Modelling risk factor information for linked census data: The case of smoking

Warning View the most recent version.

Archived Content

Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please "contact us" to request a format other than those available.

by Claudia Sanmartin, Philippe Finès, Saeeda Khan, Paul Peters, Michael Tjepkema, Julie Bernier and Rick Burnett

Administrative data are increasingly used to monitor the health of the population and to better understand health service use and outcomes.  Advantages of using administrative data for health research include large population-based cohorts, low collection costs, and reduced bias from loss to follow-up.1-3 Despite these advantages, administrative data have limited individual-level information, frequently restricted to demographics such as age and sex, and often do not include socio-economic or risk factor information, which limits a broader understanding of health outcomes.

To overcome these deficiencies, ecological approaches have “appended” area-level measures, such as neighbourhood indicators of socio-economic status, to administrative data.4-6 However, ecological methods are prone to potential misclassification, underestimation of effect sizes, and inability to adjust for competing factors.7-9 Moreover, the results of area-based studies reflect the characteristics not only of the population, but also of the physical and social setting of the particular geographic regions.10

Statistical techniques have also been employed to indirectly adjust for missing data that are associated with health outcomes. For instance, partitioned regression uses information from ancillary sources to adjust for missing risk factors.11 This approach depends on the availability of such information from external data sources or in the literature.

Increasingly, data linkage is being used to fill information gaps in administrative data. For example, individual-level information collected in health surveys has been linked to hospital records to study broad determinants of hospital utilization.12-14  These linked data are rich in individual-level information, but sample size and coverage issues often restrict analyses of subgroups and less common outcomes.

To address this shortcoming, Statistics Canada initiated a series of projects to link information from the Census of Population long form with health outcome information, namely, mortality, hospitalization and cancer.15,16 These linked datasets offer extensive, individual-level socio-economic information and large sample sizes, but they lack information on risk factors such as smoking and obesity.

This study assesses the feasibility of using statistical modelling techniques to fill information gaps related to risk factors, specifically, smoking, in linked census data.15 Based on the Canadian Community Health Survey (CCHS), predictive algorithms were developed to model smoking status using variables common to the CCHS and the 1991 long-form Census. The resultant smoking variable was validated by comparing the performance of modelled versus self-reported smoking status in predicting smoking-related hospitalizations based on linked health survey and hospital data. This was considered an important step, since understanding how the modelled information performs in analysis is critical to assessing the utility of this approach.


Data source

Data from the CCHS were used to develop and validate predictive models for smoking status. The CCHS is a cross-sectional survey providing information about the health, lifestyle and health care use of the non-institutionalized household population aged 12 or older in the provinces and territories. The survey excludes full-time members of the Canadian Forces and residents of Indian reserves and some remote areas. A detailed description of the CCHS is available elsewhere.17

The 2000/2001 CCHS, the cycle closest in time to the linked 1991 Census cohort, was used to construct the predictive models. The response rate was 85%, for a total sample of 131,535. The sample for the present study was restricted to respondents aged 25 or older, the age criterion applied to the linked 1991 Census cohort. Records with missing information on smoking status were excluded, resulting in a final CCHS sample of 104,204.

Data from the 2002/2003 CCHS were used to externally validate the predictive models. The response rate was 81%, for a total sample of 134,072. Similar exclusions resulted in a final validation sample of 107,398.

The 2002/2003 CCHS data linked to the Hospital Morbidity Database (HMDB) (2001/2002 to 2004/2005) were used to evaluate associations between the modelled versus the self-reported smoking variable and smoking-related hospitalizations.  The HMDB is a person-level administrative dataset representing inpatient hospitalizations from most acute care hospitals and some psychiatric, chronic and rehabilitation hospitals in Canada.18 Data linkage was conducted among CCHS respondents living outside Quebec who agreed to link and who provided a valid personal health number (n=81,364). Similar exclusions were applied to the linked data (age 25 or older; missing smoking data), yielding a final sample of 52,396. Details about the data linkage are provided elsewhere.12,19

Development of predictive models

Separate models were constructed to predict two smoking categories: current daily smokers and never smokers.  Smoking categories were derived based on self-reported information in the CCHS.20 Current daily smokers were defined as respondents who reported that they smoked on a daily basis (1=yes, 0=no). Never smokers were those who reported that they had never smoked or had smoked fewer than 100 cigarettes in their lifetime (1=yes, 0=no). Attempts to predict former smokers were unsuccessful, as models were unable to discriminate between current, never and former smokers.

To be used to predict smoking status, CCHS variables had to be available in the census (long form) and to have been shown to be or hypothesized to be associated with smoking. When possible, the CCHS variables were coded to match the census variable definitions. Economic, socio-demographic, housing and ethno-cultural variables were used to predict smoking status (Table 1).

Multivariate logistic regression models were constructed to predict the probability of being a current daily smoker and a never smoker. Age-/Sex-specific models were developed because preliminary analyses revealed variability in the factors associated with smoking status across age and sex groups. The full study sample was used in both sets of models so that each CCHS respondent had a probability estimate of being a current daily smoker and a probability estimate of being a never smoker. The stepwise technique was applied to ensure selection of a parsimonious list of variables for each age and sex group; variables were included in the model by decreasing strength of significance. Survey weights were used, and the bootstrap technique was applied to the final multivariate regression models to adjust for the complex design of the CCHS. The models were developed using SAS’s PROC LOGISTIC version 9.1.

Model-specific thresholds were established to classify respondents into smoking categories. Specifically, Receiver Operating Characteristic (ROC)21-24 Curves were generated to determine age-/sex-specific optimal probability thresholds. If the estimated probabilities of being a current daily smoker or a never smoker exceeded the optimal thresholds, individuals were identified as positive cases. Optimal thresholds were generated to balance between false positives and negatives, with the aim of reducing the former.  Given the large sample sizes associated with the census, focussing on true positives provides a more accurate model, even if a large number of false negatives are generated.

Model validation was assessed based on Area Under the Receiver Operating Curve (AUC), which is a plot of sensitivity versus 1 minus specificity. In addition, the percentage of cases accurately predicted was calculated by comparing smoking status based on self-reported and modelled information.

Assignment of smoking status

The predicted probabilities for current daily smoker and for never smoker were used to assign each individual to both a current daily smoker category and a never smoker category, based on the sex-/age-specific thresholds. Both classification systems were then used to make a final assignment:

Smoking status
Table summary
This table displays the results of smoking status. The information is grouped by never smoker (appearing as row headers), current daily smoker, calculated using yes and no units of measure (appearing as column headers).
Never smoker Current daily smoker
Yes No
Yes Unclassifiable Never smoker
No Current daily smoker Other

For example, respondents whose probability of being never smokers exceeded the age-/sex-specific threshold, and whose probability of being current daily smokers was below the age-/sex-specific threshold were classified as never smokers. Respondents identified as being both current daily smokers and never smokers were deemed unclassifiable and removed from further analysis. Respondents classified as other were determined to be neither current daily smokers nor never smokers; they could be occasional smokers or former smokers, or might represent false negatives.

Additional threshold bands, defined as optimal threshold +/-0.05 or +/- 0.10, were generated to conduct sensitivity analyses. If the predicted probabilities were greater (lesser) than the upper (lower) threshold, respondents were identified as positive (negative) cases with respect to current daily smoker and never smoker status. This was deemed appropriate because the predicted value of the outcome was not the end product of the analysis, but rather, the appropriate assignment of smoking status.

Application of modelled smoking status

Linked 2002/2003 CCHS and 2001/2002 to 2004/2005 hospital data were used to determine how the modelled smoking variable performed in analyses of health outcomes. The objectives were: 1)  to compare the association between smoking status and smoking-related hospitalizations using modelled versus self-reported smoking status; and 2) to assess the effect of using modelled smoking status on covariates also used to predict smoking status (for example, income, education). It was hypothesized that the effect size of the covariates may be reduced when using modelled smoking status, since similar variables were also used to predict smoking status.

A two-year follow-up period from the time individuals responded to the CCHS was examined to identify those who had at least one smoking-related hospitalization, defined as respiratory disease, cardiovascular or cancer-related admissions (based on ICD-9/10 and ICD-10-CA codes) reported as the primary diagnosis.25 Logistic regression analyses were conducted to compare the results of using modelled versus self-reported smoking status: never smoker (reference group), current daily smoker, and other. A model-building approach was used to generate unadjusted models (Model 1: smoking status only), partially adjusted models (Model 2: smoking status + age and sex), and fully adjusted models (Model 3: Model 2 + additional socio-economic variables).

Survey weights for the linked CCHS file were adjusted by Statistics Canada to control for non-response to the survey and for the exclusion of records of respondents who did not agree to link and/or did not provide plausible health numbers. The bootstrap technique was applied to all analyses to account for the complex survey design in the estimate of variance and confidence intervals.


Study population

Based on responses to the 2000/2001 CCHS, approximately 41% of the household population aged 25 or older were never smokers, and 26% were current daily smokers (Table 1). The majority of people were married or in a common-law relationship (71%), were employed (64%), owned their dwelling (73%), lived with at least one other person (85%), and had been born in Canada (76%). Around 40% had at least some postsecondary education. Just under half (46%) lived in urban areas with more than 500,000 inhabitants.

Predictive models

The variables that were important in predicting smoking status differed by age group and sex and are presented in order of significance (Table 2). For models predicting current daily smoker, income quintile, education, marital status, dwelling ownership and world region of birth were significant predictors across all age and sex groups. For the never smoker models, marital status, dwelling ownership, Aboriginal ancestry and world region of birth were significant predictors across all age and sex groups. When the age-/sex-specific optimal thresholds were applied to the probabilities generated from the predictive models, close to 80% of respondents were assigned to either the current daily smoker or never smoker categories, 7.0% were unclassifiable, and 14.6% were classified as other.

AUC values ranged from 0.59 to 0.73 for the current daily smoker models, and from 0.60 to 0.70 for the never smoker models. Using optimal thresholds, the percentage of cases correctly predicted based on modelled values ranged from 54% to 67% for current daily smoker, and from 57% to 65% for never smoker, with AUC values decreasing with advancing age. The percentage of correctly predicted cases decreased when the wider threshold bands (optimal +/- 0.05 and optimal +/- 0.10) were used.

Modelled versus self-reported smoking status

Logistic models were developed to compare the performance of modelled versus self-reported smoking status in predicting smoking-related hospitalizations, and to assess the effect of using the modelled variable on covariates that had also been used to predict smoking status (for example, income, education).

As expected, based on self-reported smoking status, being a current daily smoker rather than a never smoker was associated with increased odds of at least one smoking-related hospitalization in both unadjusted and adjusted models (Table 3). The association was similar, but weaker, when modelled smoking status was used. Unadjusted odds ratios for modelled current daily smoker status ranged from 1.81 to 2.99 across various threshold definitions. The odds ratios remained significant in the fully adjusted models using the optimal threshold (OR: 1.30) and the optimal threshold +/- 0.05 (OR: 1.52), but were lower than the odds when self-reported smoking status was used (OR: 2.19).

Overall, variables significantly associated with smoking-related hospitalizations in the model using self-reported smoking status (Model A) remained significant when modelled smoking status was used instead (Table 4). Older age, Aboriginal identity, widowhood, lower education and being unemployed or not in the labour force were consistently associated with higher odds of a smoking-related hospitalization. Being female and being never married were associated with lower odds of a smoking-related hospitalization. Income was not associated with smoking-related hospitalizations, regardless of whether the model incorporated self-reported or modelled smoking status.


This study examined the feasibility of using statistical modelling techniques to predict smoking status, and then assessed the association between the modelled variable and smoking-related hospitalizations. The set of socio-economic and demographic characteristics that were predictive of smoking status varied by age and sex, which highlights the importance of developing age-/sex-specific models.

Model validation revealed AUC values close to 0.70 for most of the age/sex models using the optimal threshold, somewhat below values achieved in other studies.26 However, this project is unique in that no health-related variables were used to predict smoking status, whereas in other studies, health-related characteristics are commonly used to predict outcomes such as hospitalization and mortality. AUC values were consistently low for the female aged 65 or older models for both current daily smokers and never smokers.  The ability of the predictive models to accurately assign smoking status decreased when threshold levels were relaxed.

This study was motivated by the need to provide risk factor information in census data that are linked to administrative records to study characteristics associated with health outcomes. Hence, it was critical to demonstrate the feasibility of using modelled smoking status in a research context. The linked survey and hospital data offered this opportunity.

The results of the regression analysis that compared associations between modelled versus self-reported smoking status and smoking-related hospitalizations demonstrated the viability of the modelled variable. Modelled smoking status behaved like self-reported smoking status in terms of direction of association and significance, albeit with smaller effect sizes. Furthermore, the use of modelled smoking status did not eliminate associations between hospitalization and other covariates (for example, marital status, education, employment status). The association between modelled smoking status and hospitalization was reduced in the fully adjusted models, but remained significant.


This study has several limitations. The CCHS excludes specific subgroups (Canadian Forces, residents of Indian reserves and some remote areas) and people who did not agree to link their data; these exclusions may have affected the final models used to predict smoking status. The feasibility of using modelled smoking status was assessed only in the context of smoking-related hospitalizations using logistic regression analysis. Further investigation is needed to determine if this modelled variable can be used in studies employing alternative techniques (for example, survival analysis) and/or outcomes (for example, mortality).


Data linkage is a cost-effective method of obtaining person-level data to study health outcomes at the population level. However, data gaps, specifically, a lack of risk factor information, may exist. This study demonstrates the feasibility of using statistical modelling techniques to implement information in data sources.

Date modified: