Canadian Survey on Disability, 2017: Concepts and Methods Guide 6. Weighting and creation of final data files

Archived Content

Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please "contact us" to request a format other than those available.

Text begins

6.1 Weighting

In a sample survey, each respondent represents not only himself or herself but also other people who have not been sampled. For that reason, each respondent is assigned a weight which indicates the number of people that he or she represents in the population. To maintain data coherence and ensure that the results accurately represent the target population and not just the individuals sampled, that weight must be used to compute all estimates.

There are several steps in calculating the weights for the Canadian Survey on Disability (CSD). The first step is to assign each unit selected for the CSD an initial weight based on the sample design. The initial weight is the inverse of the inclusion probability. A number of adjustments are then made to the weights to control for exclusions during collection, for non-response and to avoid extreme weights in the estimation domains. The final step involves calibrating the survey weights on the census estimated totals and making some adjustments to account for units that were in scope during the May 2016 selection process but out of scope at the time of the survey in 2017. The main steps in the weighting process are described in the subsections below.

Calculation of the initial weights

Initial weights need to be calculated for both the YES and NO samples.

Since the sample design of the YES sample (the CSD sample) is based on the design of the long-form census questionnaire (distributed to a sample of the population), the initial weight is the product of the weight of the YES frameNote and the inverse of the CSD sampling fraction. The CSD sampling fraction is the size of the sample selected in a stratum divided by the number of units available in the sampling frame in that stratum.

Meanwhile, the initial weight of the NO sample is determined by multiplying the final weight of the long-form census questionnaire by the inverse of the NO sample’s sampling fraction. For more information on the weighting strategy for the long-form census questionnaire, see Chapter 9 of the Guide to the Census of Population.

The weight adjustments described in the next subsections only pertain to the YES sample, as there was no collection done for the NO sample; and therefore no need for any adjustment other than the final calibration.

Adjustment for units not sent to collection

The sample selected for the CSD was expanded slightly in anticipation of the exclusion of some units from collection for the following reasons:

• selection of more than three members of the same household;
• no telephone number available to contact the selected person;
• no name and no date of birth reported on the census for a person, so no way to identify the right respondent in the household;
• selection of persons in households previously selected for the Aboriginal Peoples Survey when the total number of interviews for both surveys is four or more (we do not want to interview more than three people per household).

These losses were taken into account in calculating the sample size, and oversampling in some strata was done to compensate for them. Units excluded from collection were thus treated as non-respondents and weights of units sent to collection were adjusted.

To do this, the sample was divided into two groups: units sent to collection and units not sent to collection. A logistic regression model was used to model the probability of being sent to collection using variables from the census frame, as they were available for all the sampled units. The following variables were used for this model: Aboriginal group; census collection mode; low-income indicator for the household; size of household; marital status; stratum (combination of the type of region and degree of severity); number of bedrooms in dwelling; population centre indicator; consent to release census data in 92 years; census family type; household income; language spoken at home; dwelling type; home ownership versus rental indicator; and, age and sex of the selected person. Interaction between some of these variables was also taken into account.

Using this model, we obtained the probability of being sent to collection for each sample unit. Homogeneous response classes were then formed by combining units with a similar probability of being sent to collection. An automatic class formation methodNote was used to generate homogeneous classes with respect to predicted probability which comprised a sufficient number of units sent to collection to avoid excessively large weight adjustment factors. A total of 42 classes were formed, and within each, the weight of the units not sent to collection was redistributed to the units sent to collection.

There were two major categories of non-response in the CSD:  non-contact, and non-response after contact. These two types of non-response were treated separately, as they constituted two different phenomena. The factors that explained non-contact tended to be more related to household characteristics and the geographic mobility of persons, while the factors that explained non-response in a contacted household tended to be more related to the individual’s characteristics.

First, units sent to the field were separated into two large groups: units that were contacted, and units that were not contacted. Logistic regression was used to model the probability of being contacted. The explanatory variables for the model came from the census frame. The variables selected for the non-contact model were: owning or renting the dwelling, domain of estimation, reporting method on the census, census family structure, one-year geographic mobility indicator, highest level of schooling completed, dwelling condition, consenting to release one’s census data in 92 years, registered Indian status, indicator of an emotional, psychological or mental health condition, five-year geographic mobility indicator, total household income, number of people in the household, occupation, language spoken at home, indicator of other health problem or long-term condition, structural type of dwelling, age group, indicator of difficulty seeing, total personal income, main mode of commuting, the number of children of respondent and respondent’s sex.

With this model, for each unit (contacted or not), a probability of being contacted was calculated. Then, response homogeneity classes were formed by grouping units that had similar probabilities of being contacted. An automatic class-formation methodNote was used to generate classes that were homogeneous with respect to predicted probability of being contacted and contained a sufficient number of units contacted to avoid excessively large weight adjustment factors. A total of 70 classes were formed, and within each one, the weight of the non-contact units was redistributed to the contacted units.

Next, an adjustment was made for a subset of contacted persons who had a disability or health condition that prevented them from responding, or who completed the Disability Screening Questions (DSQ) module (and, based on their responses, had a disability) but not the rest of the CSD interview and hence could not be considered as respondents. Since there were very few of these cases (about 200), a relatively simple adjustment was made at the stratum level, redistributing the weight of non-respondents with a disability among respondents with a disability.

The next step was the adjustment of the weights for other non-respondents (generally refusals). In this case as well, logistic regression was used to model the probability of responding given the fact that contact had been made at the household level. The variables selected for the non-response with contact model were: domain of estimation (combination of province and age group), strata (combination of type of area and level of severity), marital status, consenting to release one’s census data in 92 years, household living arrangements of person, indicator of an emotional, psychological or mental health condition, indicator of other health problem or long-term condition, indicator of difficulty seeing, visible minority indicator, highest level of schooling completed, total household income, living in a large urban centre or not, number of bedrooms in the dwelling, owning or renting the dwelling, number of household maintainers, NAICSNote sector, knowledge of official languages, first official language spoken, Inuit mother tongue, place of residence one year ago, and main mode of commuting.

With this non-response model, for each unit (respondentNote or not), a probability of responding given contact was obtained. Response homogeneity classes were then formed by grouping units that had similar probabilities of responding. The same procedure was used as for the contact model, which resulted in the formation of 52 classes. Within each class, the weight of the non-respondent units was redistributed among the respondent units.

It should be noted that out-of-scope units (deaths, institutional admissions, persons who now live outside the country, etc.) were initially considered to be respondent units, in that we were able to speak with a household member who confirmed the unit’s out-of-scope status. Their weight was not set to 0; rather, it was retained because they represented units of the initial population (on May 10, 2016) that were out of scope in the spring of 2017. However, these units are excluded from the analytical file.

Adjustment for extreme weights by province

Following the non-contact and non-response adjustments, the distribution of respondents’ weights was examined to detect the presence of very large weights by province or by estimation domain. Some adjustment factors may have generated very large weights for some individuals compared with others in some domains, which could have a detrimental effect on the estimates and their variance. The sigma-gap method was used to detect these extreme weights first within each province. An example of how the sigma-gap method can be applied is given in Bernier and Nobrega (1998).Note As used here, the sigma-gap method is intended to detect large gaps between successive weights sorted in ascending order (when they are greater than the median). When an excessively large gap is found between two successive weights, the larger of the two weights and all subsequent weights are classified as outliers. To assess the size of a gap between two weights, it was compared with a certain number of standard deviations of the distribution of all weights. For the CSD, gaps between weights that were two times the distribution’s standard deviation within each province were identified. The choice of two standard deviations was made because it matched the gap that would have been used to identify outlier weights had we used a manual process. All the weights identified as outliers were set to the province’s highest non-outlier value. In total, weights were decreased for 11 units. The resulting weight reduction from this step will be offset at the calibration step.

Before identifying extreme weights in estimation domains, estimation domain jumpers were examined.

Estimation domain jumpers and extreme weights by domain

CSD estimation domains are formed by cross-classifying the province and age group. The age used for this purpose was taken from the Census Response Database. In some cases, the age reported on the census is incorrect, either because the person who completed the census questionnaire for the household made a mistake or because of a data entry error or an issue with the optical reader used for paper questionnaires. In some cases as well, no birth date or age were reported on the Census, so an approximate age had to be imputed in the survey frame. However, since all respondents are asked their age at the beginning of the CSD interview, it is possible to assign them to their proper estimation domain. Consequently, 183 CSD respondents changed estimation domains. In such cases, the weight was compared with the range of weights for their new domain. When the individual’s weight fell within the range of weights in the new domain, it was retained with no change. On the other hand, if it fell outside the range of weights in the new domain, it was changed to the new domain’s minimum value (if it was below the range) or maximum value (if it was above the range). In this step, we adjusted the weight of 34 individuals in the CSD sample.

Next, the sigma-gap method was used once again, this time using the final estimation domain, and comparing the gap between two successive weights (above the median) in relation to twice the standard deviation of the distribution of weights in the domain. At this point, the weights of seven units were reduced. Weight reductions made at this step will be offset in the post-stratification step.

Post-stratification

Post-stratification was performed separately for the YES and NO samples. The weights of the NO sample were then adjusted to reflect certain losses observed in the YES sample between the census and CSD collection, which could not be observed in the NO sample as there is no collection for this sample. The losses observed are mainly due to deaths, institutionalizations and emigration. These steps are described in the following sections.

Post-stratification for the YES sample involved adjusting the weights of CSD respondents (out-of-scope cases, respondents with a disability, and respondents without a disability) in order to obtain the same weighted totals as the Census of Population (long-form questionnaire, excluding First Nations reserves and people under 15 years)  for the YES population by province, age group, sex and severity. The term “severity” refers to the three levels of severity used to stratify the CSD based on responses to the six filter questions on Activities of daily living. Post-stratification across the 10 provinces was done using the following age groups: 15 to 24, 25 to 34, 35 to 44, 45 to 54, 55 to 64, 65 to 74, and 75 and over. It was decided to use 10-year age groups rather than 5-year groups to calibrate all three severity levels within each group. In the territories, three age groups were used for post-stratification: 15 to 44, 45 to 64, and 65 and over. In addition, due to the small samples sizes in Nunavut, all three levels of severity were combined for respondents in the 65+ age group.

A similar post-stratification was performed on the preliminary weights of the NO sample using the initial weights calculated earlier. Post-stratification for this sample was done by province, five-year age groups, and sex. As the severity is null for the entire NO population, there is no need to post-stratify based on this variable. Post-stratification was done for the following age groups in the 10 provinces and 3 territories: 15 to 19, 20 to 24, 25 to 29, 30 to 34, 35 to 39, 40 to 44, 45 to 49, 50 to 54, 55 to 59, 60 to 64, 65 to 69, 70 to 74, and 75 and over.

During collection of the CSD, some out-of-scope cases were found among the selected respondents. The weight associated with these out-of-scope cases was used to estimate the number of people in the YES population who became out of scope between Census Day and CSD collection. CSD collection occurred between 11 and 16 months after the census, and just over 196,000 out-of-scope cases are estimated in the YES population, or 2% of this population. Estimates for the various types of out-of-scope cases in the YES population are presented in the table below.

﻿
Table 6.1
Estimated number of out-of-scope cases among the YES population
Table summary
This table displays the results of Estimated number of out-of-scope cases among the YES population. The information is grouped by Type of out-of-scope case (appearing as row headers), Unweighted, Weighted and
Weighted, calculated using number and percent units of measure (appearing as column headers).
Type of out-of-scope case Unweighted Weighted Weighted
number percent
Deaths 703 139,810 71.2
Emigrants 24 6,100 3.1
Persons less than 15 years old 13 1,790 0.9
Other 7 710 0.4
Total 1,065 196,330 100.0

The three most common types of out-of-scope cases in the YES population are deaths (72%), institutional admissions (24%) and emigrants (3%). There were very few other types of out-of-scope cases.

Since out-of-scope cases are excluded from the analytical file—and therefore from the disability rates—it is important to also try to exclude them from the rate denominator (which includes both the YES and NO populations) to avoid underestimating disability rates. Since the NO sample was not sent to collection, out-of-scope cases cannot easily be identified. Furthermore, it cannot be assumed that the proportion of out-of-scope cases in the NO population is the same as it is in the YES population, specifically as regards death and institutionalization. Therefore, an indirect method was used to estimate and exclude out-of-scope cases in the NO population. It is not possible to correct for all types of losses in the NO sample because there are often no reliable data to do so; however, we tried to make corrections where possible.

We asked the Demography Division to provide the attrition rates of the population aged 15 and over from May 10, 2016 to midway through the CSD collection period (June 1st, 2017). These rates are calculated for the entire population, including collective dwellings and institutions, but excluding First Nations reserves. They were then applied to the population covered by the 2016 long-form Census questionnaire (which excludes collective dwellings and institutions, as well as First Nations reserves) to estimate the total number of losses due to death and emigration for this population. Seeing as these losses cover the entire population (i.e., both the YES and NO populations), the estimates for deaths and emigration derived from the YES population can be subtracted to obtain an estimate of the losses due to death and emigration for the NO population. The weight of the NO sample is then adjusted downward to reflect these losses. This adjustment is made by province/territory, age group and sex.

This method slightly overestimates losses because the attrition rates are calculated for a population that includes people who were living in an institution at the time of the census. However, the fact that it is not possible to correct the NO population for losses due to institutionalization somewhat offsets the overestimation. It should be noted that some of the deaths that occurred in institutions may have involved people who had been living in a private household at the time of the census, then were institutionalized and eventually died. Consequently, part of the overcorrection for deaths offsets the lack of corrections made for institutionalizations.

In the YES population, the number of deaths and emigrants between May 10, 2016 and June 1, 2017, is estimated to be roughly 146,000, or 1.5% of the YES population in the census. Adjustments for deaths and emigrants in the NO population reduced it by approximately 168,000 people, or 0.9% of the NO population in the census.

Table 6.2 provides the population counts for the YES and NO populations before and after the exclusion of out-of-scope cases.

﻿
Table 6.2
Weighted count for the YES and NO populations
Table summary
This table displays the results of Weighted count for the YES and NO populations . The information is grouped by Population (appearing as row headers), Weighted count
before excluding out-of-scope cases and Weighted count
after excluding out-of-scope cases, calculated using number units of measure (appearing as column headers).
Population Weighted count
before excluding out-of-scope cases
Weighted count
after excluding out-of-scope cases
number
YES population 10,016,500 9,820,170
NO population 18,360,000 18,188,690
Total 28,376,500 28,008,860

6.2 File structure and content

Two analytical data files were created for CSD data: a file for persons who have a disability, and a file for persons who do not have a disability. Depending on the type of analysis required, researchers will use either the file on persons with a disability only or both files together.

The file on persons with a disability contains data on those persons selected for the CSD who, according to the definition of disability used in the CSD, are considered to have a disability. This file is the more comprehensive of the two. It contains all CSD data and many variables linked from the census. Any analysis that deals exclusively with persons with a disability can be done with this file alone.

The file on persons without a disability contains data on two groups of people: one from the CSD’s YES sample, and one from the NO sample. The two groups are as follows:

• False positives from the Yes Sample: Persons interviewed for the CSD who, upon completion of the Disability Screening Questions (DSQ), were not identified as having a disability (false positives). These respondents either reported that they were “never” limited in their day to day activities because of their condition or they reported being limited only “rarely” with “no difficulty” or “some difficulty” in performing certain tasks. All of these respondents were deemed not to have a disability and therefore did not have to complete the rest of the questions in the CSD.
• Persons from the NO sample: Persons from the NO sample are those who reported no difficulties or long-term conditions on any of the 2016 Census Activities of Daily Living filter questions. This group was not sent to CSD collection for disability identification: as a result of their responses to the census filter questions, these persons were automatically deemed not to have a disability.

Hence, the file on persons without a disability has different content depending on the group of people involved. For the persons in group (a), the false positives, only the data from the CSD’s DSQ module is captured, since the interview was terminated immediately after that module. However, the census variables are also available for this group. For the persons in group (b), from the NO sample, only the census variables are captured, since no CSD collection was done for those units.

The file on persons without a disability should be used together with the file on persons with a disability for two types of analysis: 1) calculation of disability rates, since the denominator must include both persons with a disability and persons without a disability, and 2) comparison of the census characteristicsNote of persons with a disability and persons without a disability.

To distinguish between the two groups in the analytical files, a derived variable was created, CSDPOPFL, which takes a value of 1 for persons with a disability, 2 for group (a) persons without a disability (false positives), and 3 for group (b) persons without a disability (NO sample).

The table below summarizes the contents of the two data files for each of the population groups. As shown, the two files will have different sets of variables. The analytical file on persons WITH a disability will have a complete set of variables. For the analytical file on persons WITHOUT a disability, some of the variables will be missing from the file. Missing variables will be slightly different for each of the two population groups on that file. As a result, when using the analytical file on persons WITH a disability together with the file on persons WITHOUT a disability, some variables will show missing values for persons WITHOUT a disability.

﻿
Table 6.3
Available content for various groups of persons in the Canadian Survey on Disability analytical files
Table summary
This table displays the results of Available content for various groups of persons in the Canadian Survey on Disability analytical files. The information is grouped by Analytical
data file (appearing as row headers), Population group, CSDPOPFL, Demographic variables , DSQ, CSD thematic content, Census variables and Final person-weight (appearing as column headers).
Analytical
data file
Population group CSDPOPFL Demographic variables DSQ CSD thematic content Census variables Final person-weightTable 6.3 Note 1
File on persons WITH a disability Persons WITH a disability 1 Note : content is available Note : content is available Note : content is available Note : content is available Note : content is available
File on persons WITHOUT a disability Persons WITHOUT a disability
(group (a))
2 Note : content is available Note : content is available Note : content is available Note : content is available
Persons WITHOUT a disability
(group (b))
3 Note : content is available Note : content is available Note : content is available

A note on reference periods

When calculating disability rates or comparing the characteristics of persons with a disability and persons without a disability, the reference date is May 10, 2016, Census Day. This is the date when the CSD sampling frame was defined and when comparative census indicators were collected for persons with and without disabilities. However, when researchers are only interested in persons with a disability, they will work with the CSD data collected and measured in the spring and summer of 2017 for the subset of persons with a disability. In this case, the reference period is from March 1 to August 30, 2017.

In other words, for the CSD, persons with a disability are individuals who reported having difficulty sometimes, often or always on the Activities of Daily Living question of the 2016 long-form Census, and who reported having a disability in the CSD in 2017. Hence, the CSD’s characteristics for persons with a disability are based on 2017 information about a population defined in 2016.

6.3 Final datasets and data dictionaries

Final data files included the following:

• Analytical files for use in Research Data Centres (RDCs) across Canada
• Data files for use by subscribers of the Real Time Remote Access (RTRA) system at Statistics Canada

The analytical files are dissemination files with enhanced protection of respondent confidentiality for release and distribution to RDCs across Canada. They are also used at Statistics Canada to produce data tables in response to client requests. Person-weights are available on the files. Weighting is described in more detail in Section 6.1. Any variables used exclusively for processing purposes or for internal research were removed from the analytical files.

Accompanying the 2017 CSD analytical files are the following supporting documents:

• The record layout
• SAS (Statistical Analysis System), SPSS (Statistical Package for the Social Sciences) and Stata syntax to load the files
• Metadata in the form of a data dictionary for each analytical file that describe every variable and provide weighted and unweighted frequency counts
• A user guide entitled, CSD 2017: A User Guide to the Analytical Data Files, as described in section 6.4
• This CSD 2017: Concepts and Methods Guide, as an essential companion document to the user guide

RTRA data files are housed at Statistics Canada for use by subscribers who can run statistical programs on the data from remote locations. These files consist of the analytical data files but have undergone further processing. All sub-provincial geographies have been removed, permitting analysis only at the national, provincial and territorial levels.

For RTRA users, data dictionaries are provided with full descriptions for all the variables but without any data frequencies, called the “zero-frequency” versions.

6.4 Guidelines for analysis

The User Guide created for the RDC analytical files provides detailed step-by-step instructions for using the 2017 CSD data files. It includes guidelines for tabulation and statistical analysis, how to apply the necessary weights to the data, information on software packages available and guidelines for the release of data, such as rounding rules. The process of calculating the reliability of estimates, both quantitative and qualitative, is covered in detail.

For RTRA users, confidentiality rules and reliability standards are applied to all tabulation requests in an automated way by the RTRA system.

The CSD User Guide is for use in combination with the Concepts and Methods Guide and the data dictionaries.

﻿
Date modified: