Canadian Survey on Disability, 2022: Concepts and Methods Guide
6. Weighting and creation of final data files
Archived Content
Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please "contact us" to request a format other than those available.
Skip to text
Text begins
6.1 Weighting
In a sample survey, each respondent does not only represent themselves, but also other people who have not been sampled. For that reason, each respondent is assigned a weight which can be interpreted as the number of people that they represent in the population. To maintain data coherence and ensure that the results accurately represent the target population and not just the individuals sampled, that weight must be used to compute all estimates.
There are several steps in calculating the weights for the Canadian Survey on Disability (CSD). The first step is to assign each unit selected for the CSD an initial weight based on the sample design. The initial weight is the inverse of the inclusion probability. A number of adjustments are then made to the weights to correct for non-response and to avoid extreme weights. The survey weights are then calibrated to totals estimated from the census long-form questionnaire sample. Finally, an additional step is needed to account for units that were in scope during the May 2021 selection process, but out of scope at the time of the survey in 2022.
The main steps of the CSD weighting process are described in the subsections below.
Calculation of the initial weights
For the YES and the NO samples, the first step of the weighting process is to calculate the initial weight. Since both samples are selected according to a two-phase sampling design, the initial weight is calculated by taking the product of the weight at each phase.
For the YES sample, also referred to as the CSD sample, the first-phase weight is the initial weight of the census long-form questionnaire adjusted for non-response and corrected to reflect the exclusion of units due to overlap with other surveys. It corresponds to the weight on the CSD sampling frame. The second-phase weight is the inverse of the inclusion probability in the CSD sample. For the main sample, this probability corresponds to the sampling fraction, namely the size of the sample selected in a stratum divided by the number of units available in the sampling frame in that stratum. For the oversample of Veterans, an empirical approach was taken to approximate the inclusion probability of the selected units.
For the NO sample, the initial weight is calculated by multiplying the final weight of the census long-form questionnaire by the inverse of the sampling fraction of the NO sample. For more information on the weighting strategy for the census long-form questionnaire, see Chapter 12 of the Guide to the Census of Population.
Non-response adjustments
The main purpose of any non-response adjustment is to minimize the impact of potential bias arising from non-response. For the adjustment to efficiently reduce the potential bias, a rich set of information about the non-respondents is very useful. Fortunately, the design of the CSD ensures that the information collected on the census long-form questionnaire is available for all units selected in the sample. That being said, in order to reduce the potential bias without increasing the variance, the variables used to correct for non-response should also be associated with the main survey variables. To that effect, evaluations were undertaken to identify a subset of census variables that were both associated to the response probability and to the disability status.
Non-response to the CSD happened for different reasons. Some selected units were not sent to collection because of missing contact information or to lessen response burden of households. For other units, there was no successful contact made during collection. Even when a contact was made, some units could not respond because of their disability. Finally, some units refused to respond or only responded partially. These sources of non-response were treated separately because they constituted different phenomena. The factors that explained the first and the second tended to be more related to household characteristics and the geographic mobility of persons, while the factors that explained the others tended to be more related to the individual’s characteristics.
In total, four non-response adjustments were carried out to compensate for the different sources of non-response:
- An adjustment for units not sent to collection,
- An adjustment for non-contact,
- An adjustment for non-respondents known to have a disability,
- And an adjustment for other non-respondents.
Each of these adjustments were conducted in the same fashion. First, the units to which the adjustment applies were identified. A logistic model was then fitted to estimate the response probability using variables from the census database that were both related to the response probability and to disability. Homogeneous groups were then formed by combining units with a similar estimated response probability using an automatic class formation method.Note The groups were formed so that they comprised enough respondent units to avoid unduly large weight adjustment factors. Within each group, the weight of the non-respondent units was redistributed accordingly.
The following subsections provide more details about the four non-response adjustments. These only pertain to the YES sample, as there was no collection done for the NO sample and therefore, no non-response.
Adjustment for units not sent to collection
The sample selected for the CSD was expanded slightly in anticipation of the exclusion of some units because of missing contact information or to lessen the response burden. More specifically, some selected units were not sent to collection for the following reasons:
- no telephone number available to contact the selected person,
- no name and no date of birth reported on the census for a person, so no way to identify the right respondent in the household,
- selection of more than three members of the same household,
- selection of persons already selected for the Survey on the Official Language Minority Population (SOLMP),
- selection of persons from households in which persons were selected for the Indigenous Peoples Survey (IPS) and/or the SOLMP, and for which the total number of interviews for the CSD, the IPS and the SOLMP would have been four or more.
These losses were considered in calculating the sample size, and oversampling in some strata was done to compensate for them. Units excluded from collection were thus treated as non-respondents and the weights of units sent to collection were adjusted.
For this adjustment, the sample was divided into two groups: units sent to collection and units not sent to collection. Logistic regression was used to model the probability of being sent to collection. The following variables were used for the logistic model: province; mother tongue; indigenous identity; subsidized housing indicator; number of children in census family; first official language spoken; highest certificate, diploma or degree; stratum (combination of the type of region and degree of severity); household income; language spoken most often at home; membership in a First Nation or Indian band; age group; gender; dwelling value; dwelling condition; indicator of difficulty hearing; and main reason for not working the full year.
Using this model, the probability of being sent to collection was estimated for each sampled unit. A total of 37 homogenous response groups were formed, and within each group, the weight of the units not sent to collection was redistributed to the units sent to collection.
Non-contact adjustment
To carry out the non-contact adjustment, the units that were sent to collection were separated into two groups: units that were successfully contacted, and units that were not contacted. Logistic regression was used to model the probability of being contacted. The variables selected for the non-contact model were: stratum (combination of the type of region and degree of severity); indicator of difficulty doing physical activities; domain (combination of province and age group); age group; gender; household living arrangements of person; economic family status of person; census family structure; indicator of emotional, psychological or mental health conditions; indicator of other health problems or long-term conditions; visible minority; highest certificate, diploma or degree; reporting method on the census; low-income status of the household; population centre indicator; number of bedrooms; tenure (owning or renting the dwelling); dwelling condition; full-time or part-time work; occupation; first official language spoken; one-year geographic mobility indicator; military service status; and place of work status.
Using this model, the probability of being contacted was estimated for each unit sent to collection. A total of 68 homogenous response groups were formed, and within each group, the weight of the units which were not contacted was redistributed to the contacted units.
Non-response adjustment for units known to have a disability
Next, an adjustment was made to account for the subset of contacted persons who had a disability or health condition that prevented them from responding, or who completed the DSQ module (and, based on their responses, had a disability) but not the rest of the CSD interview and hence could not be considered as respondents. Because these non-respondent units are known to have a disability, their weight was redistributed among respondents who have a disability.
Logistic regression was used to model the probability of response for this subset of units. The variables selected for the model were: reporting method on the census; dwelling value; age group; mother tongue; low-income status of the household; occupation; dwelling condition; membership in a First Nation or Indian band; indicator that the property taxes are included in the mortgage payments; North American Industry Classification System (NAICS) sector; gender; marital status; indicator of difficulty learning, remembering or concentrating; presence of the person’s spouse or partner in the household; military service status; one-year geographic mobility indicator; indicator of difficulty hearing; first official language spoken; and visible minority.
Using this model, the response probability was estimated for each contacted unit who is known to have a disability. A total of 43 homogenous response groups were formed, and within each group, the weight of the non-respondent units who were known to have a disability was redistributed to the respondent units who have a disability.
Non-response adjustment for other units
A final non-response adjustment was carried out to compensate for other non-respondents (generally refusals). Logistic regression was used to model their probability of response. The variables selected for the model were: domain (combination of province and age group); reporting method on the census; highest certificate, diploma or degree; age group; household living arrangement of person; indicator of other health problems or long-term conditions; place of work status; first official language spoken; low-income status of the household; occupation; presence of the person's spouse or partner in the household; tenure (owning or renting the dwelling); economic family status of person; enrollment under an Inuit land claims agreement; membership in a First Nation or Indian band, visible minority; indicator of difficulty doing physical activities; hours worked; place of birth; household income; dwelling condition; labour force status; one-year geographic mobility indicator; number of bedrooms; full-time or part-time work.
Using this model, the response probability was estimated for each contacted unit, except for non-respondent units known to have a disability. A total of 86 homogenous response groups were formed, and within each group, the weight of the non-respondent units for which the disability status is not known was redistributed to the respondent units.
Adjustment for extreme weights by province
Following the four non-response adjustments, the distribution of respondents’ weights was examined to detect the presence of very large weights by province or by estimation domain. Some adjustment factors may have generated very large weights for some individuals compared with others in some domains, which could have a detrimental effect on the estimates and their variance. The sigma-gap method was used to detect these extreme weights first within each province. An example of how the sigma-gap method can be applied is given in Bernier and Nobrega (1998).Note As used here, the sigma-gap method is intended to detect large gaps between successive weights sorted in ascending order (when they are greater than the median). When an excessively large gap is found between two successive weights, the larger of the two weights and all subsequent weights are classified as outliers. To assess the size of a gap between two weights, it was compared with a certain number of standard deviations of the distribution of all weights. For the CSD, gaps between weights that were two times the distribution’s standard deviation within each province were identified. The choice of two standard deviations was made because it matched the gap that would have been used to identify outlier weights had we used a manual process. All the weights identified as outliers were set to the province’s highest non-outlier value. In total, weights were decreased for 13 units. The resulting weight reduction from this step will be offset at the calibration step.
Before identifying extreme weights in estimation domains, estimation domain jumpers were examined.
Estimation domain jumpers and extreme weights by domain
CSD estimation domains are formed by cross classifying the province and age group. The age used for this purpose was taken from the Census Response Database. In some cases, the age reported on the census is incorrect, either because the person who completed the census questionnaire for the household made a mistake or because of a data entry error or an issue with the optical reader used for paper questionnaires. In some cases as well, no birth date or age were reported on the census, so an approximate age had to be imputed in the survey frame. However, since all respondents are asked their age at the beginning of the CSD interview, it is possible to assign them to their proper estimation domain. Consequently, 132 CSD respondents changed estimation domains. In such cases, the weight was compared with the range of weights for their new domain. When the individual’s weight fell within the range of weights in the new domain, it was retained with no change. On the other hand, if it fell outside the range of weights in the new domain, it was changed to the new domain’s minimum value (if it was below the range) or maximum value (if it was above the range). In this step, we adjusted the weight of four individuals in the CSD sample.
Next, the sigma-gap method was used once again, this time using the final estimation domain, and comparing the gap between two successive weights (above the median) in relation to twice the standard deviation of the distribution of weights in the domain. At this point, the weights of 14 units were reduced. Weight reductions made at this step will be offset in the calibration step.
Calibration
The last step of the weighting process is a calibration to totals estimated from the census long-form questionnaire (excluding First Nations reserves and people under 15 years). The purpose of this step is to minimize the sampling variability of estimates derived from the CSD. The calibration was performed separately for the YES and NO samples.
Calibration for the YES sample was done by province on the following control totals: total number of persons by ten-year age group (15 to 24, 25 to 34, 35 to 44, 45 to 54, 55 to 64, 65 to 74, and 75 and over), total number of persons by gender (Man+ and Woman+), and total number of persons by severity (mild, moderate and high). The term “severity” refers to the three levels of severity used to stratify the CSD based on responses to the six filter questions on Activities of daily living. The weights obtained from the previous step were modified as little as possible, so that the weighted estimates would be equal to the estimated census totals for these constraints. Statistics Canada's Generalized Estimation System (GEST) was used to carry out the calibration.
For the NO sample, a post-stratification was performed on the initial weights calculated earlier. The post-stratification was done by province, five-year age groups (15 to 19, 20 to 24, 25 to 29, 30 to 34, 35 to 39, 40 to 44, 45 to 49, 50 to 54, 55 to 59, 60 to 64, 65 to 69, 70 to 74, and 75 and over), and gender. As a reminder, the severity is null for the entire NO population.
Out-of-scope adjustment
During collection of the CSD, some out-of-scope cases were found among the selected respondents. They were initially considered to be respondent units in that we were able to speak with a household member who confirmed the unit’s out-of-scope status. Their weight was not set to 0; rather, it was retained because they represented units of the initial population (on May 11, 2021) that were out of scope in the summer or fall of 2022. However, these units are excluded from the analytical file.
The weight associated with these out-of-scope cases was used to estimate the number of people in the YES population who became out of scope between Census Day and CSD collection. CSD collection occurred between 13 and 18 months after the census, and just over 265,000 out-of-scope cases are estimated in the YES population, or 2.2% of this population. Estimates for the various types of out-of-scope cases in the YES population are presented in the table below.
| Type of out-of-scope | Unweighted | Weighted | Weighted |
|---|---|---|---|
| number | percent | ||
| Deaths | 841 | 188,850 | 71.2 |
| Institutional admissions | 262 | 50,310 | 19.0 |
| Emigrants | 64 | 22,820 | 8.6 |
| Persons less than 15 years old | 14 | 3,100 | 1.2 |
| Other | 12 | 300 | 0.1 |
| Total | 1,193 | 265,380 | 100.0 |
|
Note: The sum of the values for each category may differ from the total due to rounding. Source: Statistics Canada, Canadian Survey on Disability, 2022. |
|||
The three most common types of out-of-scope cases in the YES population are deaths (71%), institutional admissions (19%) and emigrants (9%). There were very few other types of out-of-scope cases.
Since out-of-scope cases are excluded from the analytical file—and therefore from the disability rates—it is important to also try to exclude them from the rate denominator (which includes both the YES and NO populations) to avoid underestimating disability rates. Since the NO sample was not sent to collection, out-of-scope cases cannot easily be identified. Furthermore, it cannot be assumed that the proportion of out-of-scope cases in the NO population is the same as it is in the YES population, specifically as regards to death and institutionalization. Therefore, an indirect method was used to estimate and exclude out-of-scope cases in the NO population. It is not possible to correct for all types of losses in the NO sample because there are often no reliable data to do so; however, we tried to make corrections where possible.
We asked the Demography Division to provide the attrition rates of the population aged 15 and over from May 11, 2021 to midway through the CSD collection period (September 5th, 2022). These rates are calculated for the entire population, including collective dwellings and institutions, but excluding First Nations reserves. They were then applied to the population covered by the 2021 Census long-form questionnaire (which excludes collective dwellings and institutions, as well as First Nations reserves) to estimate the total number of losses due to death and emigration for this population. Seeing as these losses cover the entire population (i.e., both the YES and NO populations), the estimates for deaths and emigration derived from the YES population can be subtracted to obtain an estimate of the losses due to death and emigration for the NO population. The weight of the NO sample is then adjusted downward to reflect these losses. This adjustment is made by province, age group and gender.
This method slightly overestimates losses because the attrition rates are calculated for a population that includes people who were living in an institution at the time of the census. However, the fact that it is not possible to correct the NO population for losses due to institutionalization somewhat offsets the overestimation. It should be noted that some of the deaths that occurred in institutions may have involved people who had been living in a private household at the time of the census, then were institutionalized and eventually died. Consequently, part of the overcorrection for deaths offsets the lack of corrections made for institutionalizations.
In the YES population, the number of deaths and emigrants between May 11, 2021 and September 5, 2022, is estimated to be roughly 212,000, or 1.8% of the YES population in the census. Adjustments for deaths and emigrants in the NO population reduced it by approximately 223,000 people, or 1.2% of the NO population in the census.
Table 6.2 provides the population counts for the YES and NO populations before and after the exclusion of out-of-scope cases.
| Population | Weighted count before excluding out-of-scope cases | Weighted count after excluding out-of-scope cases |
|---|---|---|
| number | ||
| YES population | 11,854,800 | 11,589,400 |
| NO population | 18,208,900 | 17,985,600 |
| Total | 30,063,700 | 29,575,000 |
| Source: Statistics Canada, Canadian Survey on Disability, 2022. | ||
6.2 File structure and content
Four analytical data files were created for CSD data: two analytical files with data on persons with a disability (main analytical file and restricted file) and two analytical files with data on persons without a disability (main analytical file and restricted file). Depending on the type of analysis required, researchers will use either the file on persons with a disability only or both files together.
The main analytical file on persons with a disability contains data on those persons selected for the CSD who, according to the definition of disability used in the CSD, are considered to have a disability. This file is the most comprehensive of the four. It contains all CSD data and many variables linked from the census. Any analysis that deals exclusively with persons with a disability can be done with this file alone.
For the 2022 CSD, Statistics Canada identified the following sensitive variables:
- GENDER3 (3-category gender)
- SEX (Sex at birth)
- SOR_01 (Sexual orientation)
Based on Statistics Canada guidelines, the dissemination of these variables is deemed restricted, which necessitates the use of four analytical files (as opposed to two, as was the case with the 2017 CSD).
To access the restricted analytical files, users must provide a project request containing information on how they intend to use these variables.
The main analytical file on persons without a disability contains data on two groups of people: one from the CSD’s YES sample, and one from the NO sample. The two groups are as follows:
- False positives from the YES Sample: Persons interviewed for the CSD who, upon completion of the Disability Screening Questions (DSQ), were not identified as having a disability (false positives). These respondents either reported that they were “never” limited in their day to day activities because of their condition or they reported being limited only “rarely” with “no difficulty” or “some difficulty” in performing certain tasks. All of these respondents were deemed not to have a disability and therefore did not have to complete the rest of the questions in the CSD.
- Persons from the NO sample: Persons from the NO sample are those who reported no difficulties or long-term conditions on any of the 2021 Census Activities of Daily Living filter questions. This group was not sent to CSD collection for disability identification: as a result of their responses to the census filter questions, these persons were automatically deemed not to have a disability.
Hence, the main analytical file on persons without a disability has different content depending on the group of people involved. For the persons in group (a), the false positives, only the data from the CSD’s DSQ module is captured, since the interview was terminated immediately after that module. However, the census variables are also available for this group. For the persons in group (b), from the NO sample, only the census variables are captured, since no CSD collection was done for those units.
The main analyticalfile on persons without a disability should be used together with the main analytical file on persons with a disability for two types of analysis: 1) calculation of disability rates, since the denominator must include both persons with a disability and persons without a disability, and 2) comparison of the census characteristicsNote of persons with a disability and persons without a disability.
To distinguish between the two groups in the main analytical files, a derived variable was created, CSDPOPFL, which takes a value of 1 for persons with a disability, 2 for group (a) persons without a disability (false positives), and 3 for group (b) persons without a disability (NO sample).
The table below summarizes the contents of the four data files for each of the population groups. As shown, the four files will have different sets of variables. The main analytical file on persons WITH a disability will have a complete set of variables. For the main analytical file on persons WITHOUT a disability, some of the variables will be missing from the file. Missing variables will be slightly different for each of the two population groups on that file. As a result, when using the main analytical file on persons WITH a disability together with the main analytical file on persons WITHOUT a disability, some variables will show missing values for persons WITHOUT a disability.
| Analytical data files | Population group | Main analytical file | Restricted analytical file | |||||
|---|---|---|---|---|---|---|---|---|
| CSDPOPFL | Demographic variables | DSQ | CSD thematic content | Census variables | Final person weightTable 6.3 Note 1 | Sensitive variablesTable 6.3 Note 2 | ||
| Files on persons WITH a disability | Persons WITH a disability | 1 | √ | √ | √ | √ | √ | √ |
| Files on persons WITHOUT a disability | Persons WITHOUT a disability (group (a)) | 2 | √ | √ | Note .: content is not available for any reference period | √ | √ | √ |
| Persons WITHOUT a disability (group (b)) | 3 | √ | Note .: content is not available for any reference period | Note .: content is not available for any reference period | √ | √ | √ | |
|
√ Content is available . Content is not available for any reference period
|
||||||||
A note on reference periods
When calculating disability rates or comparing the characteristics of persons with a disability and persons without a disability, the reference date is May 11, 2021, Census Day. This is the date when the CSD sampling frame was defined and when comparative census indicators were collected for persons with and without disabilities. However, when researchers are only interested in persons with a disability, they will work with the CSD data collected and measured in the summer and fall of 2022 for the subset of persons with a disability. In this case, the reference period is from June 3 to November 30, 2022.
In other words, for the CSD, persons with a disability are individuals who reported having difficulty sometimes, often or always on the Activities of Daily Living question on the 2021 Census long-form, and who reported having a disability in the CSD in 2022. Hence, the CSD’s characteristics for persons with a disability are based on 2022 information about a population defined in 2021.
6.3 Final datasets and data dictionaries
Final data files include the following:
- Analytical files for use in Research Data Centres (RDCs) across Canada;
- Data files for use by subscribers of the Real Time Remote Access (RTRA) system at Statistics Canada.
The analytical files are dissemination files with enhanced protection of respondent confidentiality for release and distribution to RDCs across Canada. They are also used at Statistics Canada to produce data tables in response to client requests. Person-weights are available on the files (weighting is described in more detail in Section 6.1). Any variables used exclusively for processing purposes or for internal research were removed from the analytical files.
Accompanying the 2022 CSD analytical files are the following supporting documents:
- The record layout;
- A user guide entitled, Canadian Survey on Disability, 2022: A User Guide to the Analytical Data Files, as described in section 6.4;
- This CSD 2022: Concepts and Methods Guide, as an essential companion document to the user guide.
RTRA data files are housed at Statistics Canada for use by subscribers who can run statistical programs on the data from remote locations. These files consist of the analytical data files but have undergone further processing. All sub-provincial geographies have been removed, permitting analysis only at the national, provincial and territorial levels.
For RTRA users, data dictionaries are provided with full descriptions for all the variables but without any data frequencies, called the “zero-frequency” versions.
6.4 Guidelines for analysis
The User Guide created for the RDC analytical files provides detailed step-by-step instructions for using the 2022 CSD data files. It includes guidelines for tabulation and statistical analysis, how to apply the necessary weights to the data, information on software packages available and guidelines for the release of data, such as confidentiality, reliability, rounding and minimal samples sizes rules. The process of calculating the reliability of estimates, both quantitative and qualitative, is covered in detail.
For RTRA users, confidentiality rules and reliability standards are applied to all tabulation requests in an automated way by the RTRA system.
The CSD User Guide is for use in combination with the Concepts and Methods Guide and the data dictionaries.
- Date modified: