The data: the PSIS and the samples employed

Warning View the most recent version.

Archived Content

Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please "contact us" to request a format other than those available.

In this section, the general characteristics of the PSIS data are described, the samples created for the analysis are defined, and descriptive statistics of these samples presented. Readers who are less concerned with such details may skim the section, although they should at least gain a basic understanding of the three different samples employed ("Sample 1", and the two variants of "Sample 2"). To this end, the organisation of the section into detailed sub-sections should allow readers to go to the parts they are more interested in, although those who read the entire section will likely wind up having a better understanding of the analysis and its various nuances.

4.1   The PSIS and the Longitudinal "L-PSIS"

The unique opportunities for measuring persistence using the PSIS

The PSIS dataset has been constructed by Statistics Canada from administrative data provided to it by PSE institutions across the country in a standardised format.1 For this study, the data cover (public) PSE institutions (and students) in Atlantic Canada, with the PSIS having been put on a longitudinal footing for this region in order to facilitate an analysis of PSE pathways.

This regional impetus is, in turn, rooted in the on-going general cooperation on matters related to PSE among the Atlantic provinces, and their decision to undertake this particular project jointly in order to obtain as complete a view of PSE persistence pathways as possible, as well as to share the costs of the required data development and analysis and other related practical and analytical reasons.2

The focus on Atlantic Canada also, however, corresponds to the nature of the PSIS data: coverage is currently essentially complete in Atlantic Canada, which is not the case elsewhere in the country. As previously discussed in the literature review, such coverage allows for a much more complete, and more representative, analysis of persistence patterns than is possible with data based only on a single institution – or even a collection of single institutions which do not actually take account of switching among themselves. Individual institutions also typically lack the sample sizes required to carry out a statistically credible and detailed analysis of the kind presented here. And finally, having the broader data set allows for the direct inter-provincial comparisons that have been carried out.

In summary, the PSIS affords the opportunity to analyse persistence in a more complete and effective fashion than has been possible before – in this case for all of Atlantic Canada.3

The Longitudinal PSIS, or "L-PSIS"

The basic "cross-sectional PSIS" includes one record for each program a person was in for each year of data (2001/2002 through 2004/2005), "year" in this case being a reporting year, and reflecting the organisation of the institutions' administration data. These are essentially the raw data Statistics Canada receives from the institutions.

As discussed earlier in the section on the analytical framework, an individual will have multiple records in a given year if they were in more than one program in that year (in the same institution or at different institutions), while individuals will have records in more than one year in cases where they were in a given program beyond a single (reporting) year or were in different programs in different years.

The basic PSIS is thus comprised of a set of individual "person-year-program" records which are not linked together for given individuals – either across programs in a given year, or over time. For this project, however, individuals were linked in both these respects. In essence, every person-year-program record had an individual identifier attached.

It is this "L-PSIS" file (for "Longitudinal PSIS" to differentiate it from the underlying cross-sectional files, although this is not an official name that has been given to the file) with its personal identifiers that allows us to match individuals' records across programs in a given year, and over time. With this information, we are able to arrange the data in the manner required to identify program starts and subsequent PSE dynamics and otherwise carry out the analysis in the manner discussed above.

The linkage process whereby the individual identifiers were attached to construct the "L-PSIS" is actually relatively simple, since PSIS records generally include enough information to identify individuals and match them across their different records.

Eighty-five percent of the linkages were "deterministic", defined as cases where the matches were made within the same institution based on institution code, student number, SIN, birth date, name and gender.4

The remaining 15 percent of the linkages, were "probabilistic", based on the "GRLS" ("Generalised Record Linkage System) employed by Statistics Canada for datasets required such matching. In this system, weights are assigned to each variable used in the matching process (name, sex, birth year, etc.), different types of links are assigned based on that information, and thresholds are used to determine the final decision on a case by case basis.

By this method, when the total weight of a pair of records is greater than the upper threshold (i.e., most of the information corresponds), it is classified as a definite link. When the total weight is below the lower threshold (less information corresponds), it is classified as a rejected link. Cases with a total weight between the upper and lower thresholds are considered as possible links. Possible links require manual resolution to determine whether they are accepted or rejected. Particular care was taken for cases possibly involving twins or mother/daughter, father/son cases where some of the basic information (e.g., names, birth dates.) might be identical. All linked pairs were then checked for inconsistencies.

Further tests were carried out to see if other links may have been missed. This essentially consisted of relaxing the thresholds and checking any additional matches made with the lower thresholds. Very few additional matches were, however, made in this way, suggesting that the methods employed were indeed picking up most of the actual "correct" linkages in the data. This is a particularly important step for our analysis because the absence of a record in a given year for a given individual is interpreted as indicating the person was not in PSE in that year, which is our principal means of defining PSE leavers.

In addition, Statistics Canada further investigated records for those we identified as PSE leavers in our analysis based on the lack of any record in a subsequent period (thus implying the person was not in school) to again see if any record matches may have been missed, but again there was no evidence that this was the case, giving us further confidence in the data.

Statistics Canada generally regards the record linkage exercise to have been highly successful, and we have no reason to doubt this assessment based on our own work with the data as well as meetings with Statistics Canada personnel who explained the record linkage procedures and showed us various computer programs and data files related to that process.

4.2   Sample selection criteria

General Selection Rules

To begin, we selected into our working samples only those records which indicated the individual started a new PSE program over the 2001/2002 through 2004/2005 period covered by the data. The reasons for restricting the analysis to individuals starting new programs are two-fold. First, it is well known based on other research (and confirmed in our analysis here) that persistence rates vary with the duration of a spell, or otherwise put, depending on what year of studies the person is in (although "year of studies" is itself often difficult to define).

Thus, if we do not take spell year into account, we will obtain a set of average transition rates that are not necessarily very meaningful: how many individuals who are in PSE in a given year then graduate, continue, switch, or leave in the next year regardless of what year they are in. Identifying individuals at the beginning of their spells and following them on a year by year basis from that point therefore represents the desired set-up from an analytical perspective.

Secondly, by including only new PSE program beginners, we obtain a representative sample with well defined properties: those individuals who started a PSE program during the 2001/2002 through 2004/2005 period covered by the data.5

Sample 1

We also, however, put other restrictions on the data. First, we included only those individuals who started regular PSE programs, and excluded those taking individual courses at college or university that were not part of a program normally intended to lead to a completed diploma or degree. This is the usual definition of PSE used, for example, in calculating PSE enrolment numbers or in determining eligibility for student financial assistance.

The selected programs could, however, be at any of the following levels: college, bachelor's degree, master's degree, Ph.D., or a first professional degree. The analysis carried out with the college and bachelor's degree samples is, however, more extensive than that for the other levels. This is partly because the larger sample sizes make the college and undergraduate university level analysis more statistically reliable, and allow for a much more detailed analysis. But it is also because the dynamics at the college and undergraduate university levels are more varied, more interesting, and probably more relevant to policy-related questions, even if only for the sheer numbers involved.

Having stated these general principles, the nature of the PSIS data sometimes makes it difficult to identify when students actually started a new program as opposed to when they were continuing in a program that had started earlier, since in some cases "a new program" is only the beginning of another phase of what is essentially an on-going program (as discussed above).

Furthermore, the data on start dates and related variables can represent different things at different institutions, meaning that any attempt to differentiate new program starts from the continuation of a single program would require institution-specific treatment, and thus very complex programming. This might be a worthwhile exercise, but it is one that lay beyond the scope of this analysis.

Finally, in addition to such data problems, there probably remains an inherent ambiguity in the underlying reality of what constitutes the start of "a new program", especially in cases where a person has already been in school and is continuing their studies perhaps with a bit of a shift (e.g., a change of major or a move from one faculty to another).

The first general rule we adopted for our Sample 1, therefore, is that to be considered as a new program and thus included in the analysis, the information had to indicate that the program did indeed start in the year in question and there was no other program going on simultaneously when it started. In the case of such overlapping, it was felt there simply would have been too much uncertainty as to which program had started when and how the two programs were related, if at all.

Appendix 1 at the end of the paper shows some typical illustrative cases in point, and the sample inclusion rules that would be applied to these. Note that a program was selected into the sample in cases where there were two records in a given year if the second program started after the first one ended, and thus had a "clean" start.

In all cases – in the trivial cases where there was just one program in a given year as well in the more complicated cases where there were two programs – it was the first "clean" program in a given year (i.e., with a well defined start date at a point in time when there were no other programs in progress) that was selected into the sample.6

Sample 2

A second set of conditions is added to create our preferred Sample2. First, like Sample 1, the person must have had a "clean" program start in a given year as indicated by its start date and the information on any other concurrent programs, as described above. But we also require (unlike Sample 1) that the person was not enrolled in any other program earlier in the year regardless of the earlier program's start and end dates, nor were they in any program in the previous year. These conditions were imposed in order to further ensure that the program being selected was indeed a new start, and not the continuation of an earlier program with the potential complications discussed above.

Finally, we also restrict the sample to programs that started in August or September. While individuals do of course start programs at other times through the year, it was felt, based on our inspection of the data, that at least a substantial proportion of these might be individuals who were coming back to school after a previous start.7

This resulted in a set of very clean program starts which we think represents the best sample for this analysis, although for completeness we do present many findings for Sample 1 as well.8

Note that Sample 2 requires using the first year of data, 2001/2002, as a precondition (or "checking") year for spells beginning in 2002/2003. As a result, no new spells starting in that first year ("cohort 2001") enter this sample. (Otherwise put, there is no checking year for new programs starting in 2001/2002 since that is the first year of data).

In order to further tighten the analysis, we also restricted Sample 2 in certain places to individuals aged 17 to 20 at the beginning of their program. This was done in order to generate an even more well-defined, "clean" sample, meant to capture individuals just starting out in PSE directly or soon after completing their high school studies, and to leave the analysis as unconfounded as possible by any previous – but inconsistently identified – PSE experiences. It may also be that this group is the one in which many government policy makers are most interested. This younger group also lends itself to comparisons with the YITS data, some of which are shown below. But we appreciate that older students are also of interest to policy-makers and postsecondary institutions, and we therefore present at least some results for the other samples as well. The reader may thus decide which results they prefer with full information as to what the different sets of findings represent.

4.3   Sample characteristics

Table 1 shows the characteristics of those included in the three different samples: Sample 1, Sample 2 all ages, and Sample 2 with the 17 to 20 age restriction. The decrease in sample sizes from Sample 1 to Sample 2 represents two effects: the tighter restrictions being imposed and the elimination of all spells beginning in 2001/2002 (as described above).


Table 1 Sample characteristics


The large number of starts for cohort 2001/2002 relative to 2002/2003 and 2003/2004 represents an indication of how inclusive Sample 1 is, which we suspect is picking up many on-going programs that are simply observed for the first time in this first year of the data. This pattern further reinforces our general preference for Sample 2 over Sample 1.

Also note the number of observations lost with the additional age restriction imposed on Sample 2, especially at the college level. This reinforces the importance of offering different perspectives of the processes in question as represented by the two variants of the sample.

The differences between the total sample size numbers in Table 1 and the combined sizes of the bachelor's and college groups represent the other educational levels that figure in the analysis: those who started masters, Ph.D., and first professional degree programs. But as previously mentioned these groups figure in a more limited fashion in our analysis. Their precise numbers are shown below.

Other sample characteristics are also shown in Table 1. The samples are, overall, decidedly more female than male. This imbalance is driven in particular by the bachelor level numbers, where women make up about 60 percent of the student population, whereas the proportions are similar but tilted in the opposite direction at the college level.

We see more details on the age distribution in the next panel of the table, where again the different spreads at the college and university levels are revealed. A full 27 percent of the college students in Sample 2 are above age 26, and another 21 percent are age 21 to 25, leaving the more restrictive sample to include just the 51.5 percent of new entrants under this age.

By province, we see that students at Nova Scotia institutions comprise 42 to 47 percent of the samples, these shares holding roughly equally at the college and university levels. New Brunswick is in the 27 percentage range at the college level in Sample 2 (more in Sample 1), but higher, at 34 to 38 percent, at the bachelor's level. For Newfoundland and Labrador, the opposite pattern holds: a relatively greater representation at the college level than university (17 to 20 percent and 11 to 12 percent, respectively, in the two versions of Sample 2). Prince Edward Island comprises something under 10 percent of each of the samples, including greater shares in Sample 2 than Sample 1, which would be consistent with their students in the PSIS data generally representing more "fresh starts" than in the other jurisdictions.9 We note in this regard that the province indicated here is that of the PSE institution attended, not the origin of the student. The different relative sizes of the college and bachelors sectors by province are interesting and itself worthy of study, but for this study we simply take those patterns as given.


Notes

  1. Having to put their data into this standardised format can represent a considerable burden for institutions, but the result is a dataset that has consistent information across institutions. The benefits of this consistency were abundantly apparent to the researchers while working on this project.
  2. This cooperation has been fostered by two important intergovernmental institutions, CAMET (the Council of Atlantic Ministers of Education and Training) and MPHEC (the Maritime Provinces Higher Education Commission).
  3. The equally recent and unique opportunities for studying persistence using the Statistics Canada's YITS datasets have been discussed above.
  4. Such identifying information is then stripped from the files that are created for the analysis for reasons of confidentiality.
  5. For both these reasons, including only new spells is the standard approach in hazard analysis.
  6. The very few cases (about .7 percent) where the person had three programs or more in a given reporting year were deleted.
  7. Seventy-three point four percent of all programs started in September or August, 9.5 percent in July, 5.3 percent in January, and a scattering in other months.
  8. These rules were arrived at after conducting extensive checks of individual micro records and carrying out many sets of cross-tabulations. We are grateful to individuals in the Centre for Education Statistics at Statistics Canada for their assistance in these exercises, and to college and university representatives who explained their institutions' files during earlier presentations of this work.
  9. This makes sense in a context where the limited programs offered in Prince Edward Island force some students to go out of province as they move through their studies.