Longitudinal Administrative Data Dictionary, 2021

Release date: November 10, 2023

Skip to text

Text begins

1 Introduction

The Longitudinal Administrative Databank (LAD) is a subset of the T1 Family File (T1FF). The T1FF is a yearly cross-sectional file of all taxfilers and their families. Census families are created from information provided annually to the Canada Revenue Agency in personal income tax returns. Both legal and common law spouses are attached by the spousal Social Insurance Number (SIN) listed on the tax form, or by matching based on name, address, age, sex, and marital status. Children are identified through a similar algorithm and supplementary files. Prior to 1993, non-filing children were identified from information on their parents’ tax form. Information from the Family Allowance Program was used to assist in the identification of children. Since 1993, information from the Child Tax Benefit Program has been used for this purpose.

The LAD is a random, 20% sample of the T1FF. Selection for LAD is based on an individual’s SIN. There is no age restriction, but people without a SIN can only be included in the family component. Once a person is selected for the LAD, the individual remains in the sample and is picked up each year from the T1FF if he or she appears on the T1 that year. Individuals selected for the LAD are linked across years by a unique LAD identification number (LIN__I) generated from the SIN, to create a longitudinal profile of each individual. The LAD is augmented up each year with a sample of new taxfilers so that it consists of approximately 20% of taxfilers for every year. The 20% sample has grown over the years: 3.2 million people in 1982, 4.05 million in 1992, 4.7 million in 2002 and 5.3 million in 2012. This growth reflects increases in the Canadian population and increases in the incidence of tax filing as a result of the introduction of the Federal sales tax credit in 1986 and the Goods and Services Tax credit in 1989.

The LAD is organized into four levels of aggregation, namely the individual, spouse/parent, family, and child levels. The databank contains information on demographics, income, and other taxation data at the different levels of aggregation from 1982-2021, with new years of data being added as the information becomes available. Changes in tax legislation and in the design of the T1 form itself have resulted in some variables not being available for all years as well as some minor definitional changes from one year to the next.

The LAD also obtains information through microdata linkages to other administrative data sources including Tax Free Savings Account (TFSA) information, private corporation ownership information from Schedule 50 of the T2 tax form, and immigration information from the Landing file administrative data. In addition, a linking key resides on the Longitudinal Immigration Database (IMDB) – a database containing immigration records from 1980 to present – which allows for research to be conducted using a linked IMDB-LAD database. All microdata linkages have been approved by the relevant Statistics Canada management and privacy bodies. Further information is available at http://www.statcan.gc.ca.

The LAD has been designed to serve as a research tool from which custom tabulations can be prepared. This dictionary, in turn, has been created to assist researchers in identifying the type of information that is available from the LAD. It identifies and defines the LAD variables including historical changes.

2 Confidentiality

Statistics Canada protects the confidentiality of individuals’ tax data. Only aggregated information that conforms to the confidentiality provision of the Statistics Act is released. The LAD resides within Statistics Canada and all retrievals are done on site. Only employees of Statistics Canada can access such data directly. More information on the confidentiality procedures can be obtained from Client Services.

3 Geography

Data from the LAD are available for various levels of geography including Canada, provinces/territories, and regions (such as Census Division (CD), Census Metropolitan Area /Census Agglomeration (CMA/CA), Census sub-division (CSD) and Census Tracts (CT), etc.). Many other levels of geography are not included on the main LAD database, for example Economic Region (ER) and Federal Electoral District (FED); however these may be available in the LAD using the Postal Code Conversion File. Note that geography classifications on the LAD are based on converting postal code areas to other geographic boundaries.

4 Dictionary format and contents

Outlined below is a brief description of the next eight sections of the LAD Dictionary.

The LAD register (Section 5) is a file that is used in conjunction with the yearly LAD files. The Register outlines the years that an individual is on the LAD and provides information on the taxfiler’s sex, year of birth, and year of death. This section provides a brief description of this file and describes how it can be used to enhance LAD data analysis.

The Programming tips section (Section 6) provides information on writing programs for LAD retrievals. This information will assist those individuals who want to better access data from LAD files using the effective programming structure.

The Design of LAD variable acronyms (Section 7) is a description of the variable acronym structure. It provides insight into how to interpret the variable acronyms and information on the aggregation levels.

The What’s New section (Section 8) is a description of changes to the LAD database since the previous LAD release. It also provides a list of the new variables added to the LAD for the present income year. These new variables may also be available for previous years. Users are encouraged to check each new variable to determine the years available.

The LAD variable definitions (Section 9), typologically lists each variable by name. In addition, the following information is provided for each variable:

The Variable counts and amounts for individuals (section 10), outlines, for many variables, at the individual aggregate level, the count of individuals and the dollar amounts reported for the two most recent years of LAD data. Persons included in these counts and amounts are those who have been selected into the LAD sample.

The Definition of total income variables (Section 11) identifies and defines total income variables and highlights historical changes. Also provided are tables that outline and compare the variables that comprise market income and the Canada Revenue Agency’s (CRA) and Income Statistics Division’s (ISD) definitions of total income.

The tables outlined in this section are the following:

Finally, How to obtain more information on the inside cover provides information on how to contact us by telephone, mail, fax, or e-mail from across Canada.

5 LAD register

The LAD register is a companion data file to the yearly LAD files. It contains a selected number of variables for all individuals who are present at any time in the LAD. These variables have characteristics that should remain constant over time and thus may not be identified in a particular yearly file. A new LAD register is created every year with the addition of a new LAD yearly file from taxfiler information provided from living or deceased taxfilers and imputed individuals. Thus, the current register contains the most up-to-date information on individuals present in the LAD. On rare occasions, new information on individuals may differ from that on the existing file. In these instances, current information supersedes information in the existing LAD register.

The LAD register is a quick reference tool that can provide basic data without accessing the yearly files. For example, information such as the number of individuals in the LAD by age and sex in a given year can be tabulated directly from the register. Further, the LAD register can be employed in conjunction with the yearly files.

Following is a list of the variables that can be found on the register:

6 LAD structure and programming tips

This section provides information on the structure of LAD and also programming information for individuals who want to have a better understanding of the programming structure used to access data from the LAD files. Please note that individuals may undertake their own programming, however, only a small staff within Statistics Canada can carry out these retrievals. Access to the LAD files is restricted to protect the confidentiality of an individual’s tax data and any data that are made available will be screened through a set of rules designed to prevent disclosure.

LAD structure

There are two types of LAD files— the yearly LAD data files and the LAD register (for more details on the LAD register, refer to section 5, LAD register). LAD variables are identified with a variable name that consists of three parts: 1) the acronym name, 2) the aggregate level, and 3) the year (the four-digit year extension exists in most, but not all cases). Observations in the LAD files are sorted by a variable, named LIN__I (note that there is no year extension for this variable), which enables users to maintain a link across years.

The LAD is based on individual selection. Each year a random sample is taken of the “new” filers, based on the first time an individual files their taxes. Each tax filer gets one chance at entry to the LAD. People who were filing taxes in 1982 had their one chance at that point, while people who filed for the first time in 1983 were sampled at that point, and then those who filed for the first time in 1984 were sampled, etc.

A further important structural aspect unique to the LAD is the weighting process. The LAD weighting is designed to ensure confidentiality and valid estimates of the tax filing population. Data users must apply the population and perturbation weights in any estimates they wish to release.

Since the LAD is a 20% random sample of those with a valid SIN from the T1FF, and the T1FF represents the universe of tax filers in a particular year, users must multiply LAD sample estimates (such as counts) by 5 to produce population estimates of the overall tax filing population.

LAD users must also apply the perturbation weight variable (WGT2_I) to their estimates. The perturbation weight variable was designed to ensure confidentiality of estimates which will be released to the public, particularly for those estimates produced from a smaller number of observations. When applied, variable WGT2_I should have a relatively small effect on estimates which use a large number of LAD observations, and a likely larger effect on estimates using fewer observations – the mean effect of the perturbation weight is approximately one (1).

A technique of applying this LAD weighting consistently is to multiply the perturbation weight (WGT2_I) by 5 and use that in obtaining LAD estimates. In this way, users can obtain a population estimate (multiply by 5 due to a 20% random sample) and ensure confidentiality (perturbation weight).

While the LAD is based on individuals, it is possible in each year to identify those LAD individuals who are in a Census family, and some information about the members of their Census family. A family weight (FAMWGT) is available on the LAD to produce cross-sectional population estimates of the population of tax filing families for LAD variables ending with the letter ‘F’ character aggregate (for example, XTIRCF). Unlike with individual estimates, census family estimates only require applying the family weight variable (FAMWGT) to obtain population estimates – users do not need to apply the perturbation weight (WGT2_I) or multiply by 5. As well, only one record from each family must be used - users must ensure that for each FIN__i there is only one associated LAD record (one LIN__I). Since the LAD is a random sample, it is possible that more than one member from the same family will appear on the LAD. Estimates produced with this weight are comparable to those from the T1 Family File (T1FF).

For longitudinal analysis, the LAD only tracks individuals in the LAD sample over time. Should a researcher wish to try and track census families over time, they may perhaps have some limited success in doing so, but the LAD is not designed to track families. Since the LAD only tracks individuals, it means that if a marriage or relationship dissolves, then the spouse or partner information of the tax filer individual on LAD will not be found in subsequent LAD years following that dissolution, except in the unlikely case that both individuals happen to be selected into the LAD.

As mentioned, the variables on the LAD are available in many cases at one or more different levels of aggregation (see Section 7 for further information). There may be instances where the results users obtain seem confusing when compared across different levels of aggregation. Researchers examining the individual “_I” and parents “_P” levels of aggregation may expect that there should be a similarity of results between these levels. However, this need not be the case. A disconnect may occur when the selected LAD individual tax filer is a child. In that case, all the individual income information for this individual “I” relates to the child. However, the income information at the “P” level represents the information of the parents of that child. In that case, there will not be a congruence necessarily between the “I” level and “P” level incomes.

Researchers in such cases, may wish to use the LAD variable INDFL to examine whether the individual filer is a child or not. Users should remember that there is no restriction on the ages of children. A child is defined as anyone who is single and living with one or both parents. For example, a 50 year-old child may be living with a 70 year-old parent. This family would be classified as a lone-parent family.

LAD programming information

Data access is undertaken with SAS programming language. The next page contains a sample SAS program designed to access LAD data. The library assignments on the first three lines are the locations for the input files (first two lines) and the output files (the third line). The input files are in SAS format and can therefore be accessed with a SET or MERGE statement. This 20% sample based program is aimed at retrieving the number of Social Assistance (SA) recipients in Ontario that did not have any earnings appearing on their T4 slips, according to sex and year (in this case, 2000 to 2002). It is generally recommended that programs use the variables available in the register rather than the yearly files because the register information contains the most recent data. For example, the following program uses SXCO_I, a variable found in the register, rather than SXCO_I&yr, the variable found in the yearly LAD files. The flag_i&yr variables in the register are useful to identify individuals who have filed in a given year. In this program, only individuals who have filed every year from 2000 to 2002 are selected. At the end of the program, four tables are created from the output data file. Note that for confidentiality purposes, the weight variables wgt__i (with the LAD 10% sample) or WGT2_I (with the LAD 20% sample) must be used whenever a SAS procedure such as FREQ or LOGISTIC is invoked.

When programming in SAS, it is important to keep in mind the distinction between missing values and zeros in numeric fields. With SAS, most mathematical operations undertaken with missing values will return missing values. In LAD, in years that an individual is present, numeric variables not relevant to that individual have a value of zero. For example, if a non-family person has filed in 2000, then the value for RRSPSI2000 (contributions to a spouse’s RRSP) should be zero. If that individual has not filed in 2000, then the value will be missing. Thus, as a safety precaution, it is suggested that all numeric variables to be used in mathematical expressions be initialized to zero if missing, before using them.

Sample LAD program

* Sample SAS program using the LAD;

libname source1 ‘/LADdata/data1;          * first 10% sample ;
libname source2 ‘/LADdata/data2;          * second 10% sample ;
libname Out ‘/LADuser/xxxx/data’;          * user’s directory ;

* This sample program’s objective is to use the 20% LAD to retrieve the number of Social Assistance (SA) recipients in Ontario that did not have any earnings appearing on their T4 slips, according to sex and year (in this case, 2000 to 2002). Data for provinces and earnings are from the yearly LAD files whereas the sex variable is from the 2002 LAD register.

* The first step is to create a datafile containing all the information that we need to produce our tables. This datafile will be called SAOnt and will be saved in the ‘out’ directory. The Longitudinal Identifier Number (LIN__I) is used to merge the annual LAD datasets. ;

data out.SAOnt;
SOURCE1.LAD2000(where=(PRCO_I2000 = 5) keep=LIN__I PRCO_I2000 SASPYI2000 T4E__I2000)
SOURCE2.LAD2000(where=(PRCO_I2000 = 5) keep=LIN__I PRCO_I2000 SASPYI2000 T4E__I2000)
SOURCE1.LAD2001(where=(PRCO_I2001 = 5) keep=LIN__I PRCO_I2001 SASPYI2001 T4E__I2001)
SOURCE2.LAD2001(where=(PRCO_I2001 = 5) keep=LIN__I PRCO_I2001 SASPYI2001 T4E__I2001)
SOURCE1.LAD2002(where=(PRCO_I2002 = 5) keep=LIN__I PRCO_I2002 SASPYI2002 T4E__I2002)
SOURCE2.LAD2002(where=(PRCO_I2002 = 5) keep=LIN__I PRCO_I2002 SASPYI2002 T4E__I2002)
SOURCE1.REG2002(keep=LIN__I SXCO_I flag_i2000-flag_i2002 WGT2_I)
SOURCE2.REG2002(keep=LIN__I SXCO_I flag_i2000-flag_i2002 WGT2_I);

by LIN__I ;

If flag_i2000=1 and flag_i2001=1 and flag_i2002=1; *person must be taxfiler in all 3 years;

* We create a flag variable that identifies the SA recipients for each year. The result is three variables, flag_sa2000, flag_sa2001 and flag_sa2002, taking a value of either 1 or 0.

If (T4E__I2000=0 and SASPYI2000>0) then flag_sa2000 = 1 ;
          else flag_sa2000 = 0 ;
if (T4E__I2000=0 and SASPYI2001>0)  then flag_sa2001 = 1 ;
          else flag_sa2001 = 0 ;
if (T4E__I2000=0 and SASPYI2002>0) then flag_sa2002 = 1 ;
          else flag_sa2002 = 0 ;

run ;

* The SAS ‘freq’ procedure is used to produce our tables. We would also need to make sure that confidentiality guidelines standards are respected. ;

proc freq data = out.SAOnt;

          tables SXCO_I*flag_sa2000*flag_sa2001*flag_sa2002 /missing;
          weight WGT2_I ;

run ;

* End of the sample program ;

7 Design of LAD variable acronyms

Most LAD variables have a ten-character acronym. Each acronym consists of three parts, namely the variable name (five characters), the aggregate level (one character), and the calendar year (four characters), e.g. XTIRCI2000.

The variable name is the principal component of the acronym. The characters identify the type of information provided by the variable (see section 9 “LAD Variable Definitions”).

The one-character aggregate level character provides information on individuals of the census family according to the designated level of aggregation. There are four possibilities, namely ‘I’, ‘P’, ‘F’, and ‘K’ representing individual, parents, family and children (kids) respectively. The family types outlined in these aggregate levels refer to the status of the family at the end of the tax year. Following are details about each of these aggregate levels:

The four-characters for the calendar year, identifies the year to which the variable is associated. The LAD data are stored in separate files for each calendar year; therefore all variables in a particular year file will have the same four-character calendar year reference. The only exception in the yearly files is the variable LIN__I, the LAD individual identification number, which is available for each observation present in each year file, but does not have a calendar year as part of the acronym (note that there is also a variable for spousal LIN (LIN__PyyyyNote ) which does have the year extension as part of the acronym name). In the register file, the exceptions to the four character year are LIN__I, SXCO_I, YOB__I, YOD__I, LNDYRI, TTNFLI and IMMFLI, which are the individual’s LIN, sex, year of birth, year of death, landing year, temporary SIN flag, and immigrant flag, respectively.

8 What’s New – LAD 2021

There have been several changes and improvements to the LAD and to the LAD data dictionary since the release of the 2020 LAD.

Important information has been added to sections 6 and 7 (see above) to assist users of the LAD. Many new users of the LAD may encounter difficulties with the structure of the LAD and its data files. The aim of this new information is to provide some suggestions about how researchers might best interpret LAD results, or how best to apply information on the LAD.

As well, there have been changes to a few existing variables with updates or modifications to their descriptions. There are also some new variables which have been added to the LAD, including new variables in the COVID benefits section.

Modified variables

To assist users of the LAD some changes have been made to the definitions of certain variables. Additional information has been added to the description of the following variables to clarify where COVID benefits have already been applied in the aggregate income concept: Child Tax Benefit (CTBI_), Provincial tax credits (PTXC_), GST and FST credits (GHSTC), and Other Income (OI___). As well, many of the variable definitions in the COVID section have been updated. Users are encouraged to consult the relevant variable definitions in Section 9 below.

New variables

Six new variables have been added to the LAD 2021 database. The first is Return Method (RTNMT_) providing LAD users with more information about the method a tax filer used to submit their taxes, covering the period 2003 to the present. Next is Other refundable credits (OTHRFC_) mainly affecting farming tax filers. Also, information on COVID recovery repayments is available in variable CV19RBRP_, amount of COVID recovery benefit repayment. As well, there are three new provincial COVID benefits variables – two from Ontario and one from British Columbia. The table below lists the variable names and descriptions of the new additions to the 2021 LAD. More complete descriptions can be found in section 9.

New variables available on the LAD as of income year 2021
Table summary
This table displays the results of New variables available on the LAD as of income year 2021. The information is grouped by New variables (appearing as row headers), Years available (appearing as column headers).
New variables available on the LAD as of income year 2021 Years available
Return Method (RTNMT_) 2003 to present
Other refundable credits (OTHRFC_) 2021
Amount of COVID recovery benefit repayment (CV19RBRP_) 2021
Provincial and Territorial COVID emergency and recovery benefits
BC Recovery Benefit (CV19BCRB_) 2021
Ontario Covid-19 Child Benefit (CV19ONCCB_) 2021
Ontario Support for Learners (CV19ONSL_) 2021

9 LAD variable definitions

10 Selected income variable counts and medians for individuals, 2020 to 2021

11 Definition of total income variables

This section specifies the exact definitions of the three measures of total income that are available on the LAD, which are:

The first measure of total income is TIRC, which is the Canada Revenue Agency Taxation definition of total income as per the T1 form. The second measure, XTIRC, has been derived by the Small Area and Administrative Data Division of Statistics Canada as a more appropriate measure for statistical analysis. The components of income that are included in XTIRC are generally described in Table 1, Components of XTIRC in 2021, while the details are given in Table 5, Definition of XTIRC, 1982 to 2021.

The largest difference between XTIRC and TIRC occurs from 1986 onward because non-Taxable income is added to XTIRC. In 1986, the Government of Canada introduced the Federal Sales Tax (FST) Credit directed at the low-income population. In order to determine eligibility for the FST Credit, filers had to report their

non-Taxable income. This was defined as Social Assistance payments, Guaranteed Income Supplement (GIS), Spouse’s Allowance (SPA), and Workers’ compensation payments. As a result of adding non-Taxable income to XTIRC in 1986, the user is cautioned in comparing pre-1986 values of XTIRC with later values. For example, an increase in XTIRC from 1985 to 1986 may simply reflect the reporting of non-Taxable income on the 1986 T1 form but not on the 1985 T1, i.e. perhaps no increase in income occurred.

Other new differences are the exclusion of RRSP income for people who are less than 65 years old and the inclusion of Indian exempt employment income to TIRC.

Another difference between TIRC and XTIRC is that capital gains are included in the former but not in the latter. The remaining differences are detailed in Table 4, Differences between TIRC and XTIRC.

The third measure of total income available from LAD is market income (MKINC). MKINC is derived from XTIRC by removing government transfer payments. The components of MKINC are generally described in Table 2, Components of MKINC, 1982 to 2021, while Table 6, Definition of MKINC, 1982 to 2021, gives the detailed derivation.

Besides the change to XTIRC in 1986 due to the addition of sales tax credits, changes in tax legislation and in the content of the T1 form itself have resulted in differences in the availability of the components of total income. The trend has been towards greater availability. For example, in 1992, the components of non-Taxable income are reported separately on the T1 form, adding three variables to the LAD: NFSL, denoting net federal supplements (GIS and SPA), WKCPY, denoting Workers’ compensation payments, and SASPY denoting social assistance payments. From 1986 to 1991, only the total of these three payments was reported. A history of the changes in XTIRC is given in Table 3, History of Components of XTIRC.

In summary, this part of the LAD Dictionary specifies the components of TIRC, XTIRC, and MKINC for each year of LAD from 1982 to 2021 via:

Date modified: