4 The data file and the estimation samples

Warning View the most recent version.

Archived Content

Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please "contact us" to request a format other than those available.

The data file is Statistics Canada's Longitudinal Administrative Data base (LAD). It is a 10% representative sample of all Canadian income tax filers drawn from Canada Revenue Agency's T-1 income tax files, containing over 1.5 million records per year. The measure of earnings used in the paper is total annual wage and salary income (henceforth 'earnings'), as reported on individuals' tax forms.

The estimation samples used in this analysis include all paid workers aged from 20 to 64 who were not full-time students during the tax year, who received at least $1,000 (in 1997 constant dollars) of wage and salary income, whose earnings exceeded any net (declared) self-employment income and who reported at least two years of above-minimum earnings (as just defined) on the LAD file. These omissions are aimed at approximating Statistics Canada's concept of 'all paid workers,' while excluding those with only limited attachment to the labour market.3 Most of the exclusions stem from workers over age 64, the self-employed (most of whom had very low labour market earnings) and non-continuous participants in the labour market. Further details regarding the data file, including the coverage of the LAD, its degree of representativeness of the general population, the number of records in the full LAD file and the effects of the specific sampling exclusion criteria are contained in the Appendix of Beach, Finnie and Gray (2001).

The period covered by the study is from1982 to 2000. In order to capture inter-temporal changes in the variance components occurring over this period on a continual basis, a trade-off between the length of the window over which the variance components are calculated (i.e., max(Ti) in the Section 3 presentation) and the frequency of the observations that we generate from those intervals emerges. The longer the window for the calculation, the more degrees of freedom there are in order to identify the deviations from the mean and the better the mean represents long-term earnings; but the lower the frequency of independent observations is over the entire interval, then the fewer values one has in order to produce time series graphs and execute regression analysis. We choose a window length of five years as one that is long enough to distinguish 'permanent' or long-run earnings inequality from short-run or 'transitory' earnings instability, but it is short enough to generate a sufficient number of time-series points to allow reasonable statistical analysis of the effects of macroeconomic variables. As we seek to generate point estimates at an annual frequency, overlapping—as opposed to disjoint—windows are employed.

The entire 19-year estimation interval is divided into 15 contiguous rolling, sampling windows of equal 5-year lengths, each involving a fixed and balanced sample of workers whose earnings are positive for 5 consecutive years. The initial sample, for instance, comprises all individuals who reported positive earnings for each of the years from 1982 to 1986. The second sample comprises all individuals who reported positive earnings for each of the years from 1983 to 1987, and the 15th and final sample comprises all individuals who reported positive earnings for the years from 1986 to 2000. For each such 5-year sample, the three variance measures (from Equations (1), (2) and (3) of the previous section) are calculated: hence, the horizontal axis indicators (8286, 8387, ..., 9600) in Figures 1, 2 and 3. By construction, any two adjacent samples will share four years of data, any two samples that commence two years apart from each other will share three years of data, and any two samples that commence five or more years apart from each other will share no observations.4 The statistics that are generated from this data-generating process of rolling samples, of which there are 15 annual observations, are analogous to a moving average process over five consecutive years. Despite the obviously high correlations that exist between statistics that are calculated from samples that are either one or two years apart from each other—only in the case of when there are five or more years of time between the start dates will the calculated values be totally independent—it turns out that distinctive turning points can be discerned over the global interval from 1982 to 2000.

The estimation samples of this paper also involve breakdowns by age as well as gender. The four age groups are 'Entry' (from 20 to 24), 'Younger' (from 25 to 34), 'Prime' (from 35 to 54) and 'Older' (from 55 to 64) for both women and men. This allows us to examine earnings variability patterns over different phases of workers' life-cycles. The full set of sample sizes of the 120 samples (4 age groups for each gender over 15 cohorts) are provided in Appendix Table A.1. The samples vary between 31,500 and 489,000 data points, and they reflect the demographic shifts and labour-market participation trends that occurred over this period. In particular, over the course of the period, there is a diminishing number of younger workers and an increase in the number of women in the labour market. These patterns also reflect individuals' movements across age groups over the relevant sample period. For example, individuals exit the 'Entry' age groups and enter the 'Younger' groups as they age, and a similar dynamic operates across the entire age spectrum.

For the graphical as well as the regression analysis, we first estimate life-cycle adjusted earnings profiles based on log-earnings regressions. As mentioned above, the dependent variable is yit, the log earnings for an individual in a given year, and the independent variables consist of a quartic in age for each of the male and the female estimation samples. For these (log) earnings equations, the four age groups are pooled together for a given gender. These regressions are estimated separately for each estimation window. This results in 30 such (log) earnings regressions: a male and a female regression for each of the 15 window samples. Results from these earning equations are presented in Appendix Table A.2, and they indicate a statistically significant and strong positive (negative) effect associated with age (age squared), which are consistent with the broad earnings literature.

3 When compiling the Longitudinal Administrative Data base file, special procedures are employed in order to deal with individuals who have changed their SINs (social insurance numbers that serve as our identifier), who have multiple SINs and other non-standard cases (see Finnie 1997), which comprise on the order of 4% of the file in any given year. Full-time students are identified from tuition and education tax credit responses on T-1 forms.

4 Note that no two samples will be composed of the exact same individuals. As one moves from one sample to another with the passage of time, some new individuals will enter the sample as they meet our overall sampling criteria, and some individuals will leave the sample as they no longer meet these criteria.