Model-assisted calibration of non-probability sample survey data using adaptive LASSO
Section 5. Application to National Health Interview Survey (NHIS)
5.1 NHIS and ACS data
We
next apply LASSO calibration to National Health Interview Survey (NHIS) 2013 to
estimate the total number of adults (age 18 or older) diagnosed with cancer in
the population. The National Health Interview Survey is a nationally
representative sample of non-institutionalized civilian households collected by
a multi-stage area-probability sampling (Centers for Disease Control and
Prevention, 2005). Each month, health-related data on a cross-sectional sample
of people in selected households are obtained by face-to-face interviews. The
data provides pseudo-primary-sampling-unit (PSU), pseudo-strata, and sampling
weights to allow for weighted estimates with complex survey design. In addition
to health-related measures, NHIS also collects demographic data. Our goal is to
assess our LASSO estimator by treating the unweighted NHIS sample as reflective
of a non-probability sample, and explore how GREG and LASSO calibration compare
with the design-weighted estimator.
To
calibrate NHIS on a set of demographic and income-related variables, we use the
American Community Survey (ACS) 2013 micro-data as the benchmark data. ACS
samples are households selected through multi-stage area-probability sampling
from 3,143 counties of the U.S. The design of ACS is to improve estimates of
small areas between the decennial census long-form samples. Around three
million households are selected each year, with measures collected on household
types and individual demographics within the households. ACS also collects data
from group-quarters, which are excluded from this analysis. For ACS 2013, the
sample size for adults is 2,317,301. The NHIS 2013 sample size is 34,201 after
removing 242 cases with missing values on demographic variables. For the
purposes of this analysis, we treat the weighted estimates from the ACS as
known population totals, a reasonable assumption given the differences in
sample size.
5.2 Estimators
The
outcome variable of interest is whether a person has been diagnosed with
cancer. Define the binary indicator for the outcome variable:
We first use the NHIS 2013 sampling weights,
and design variables to obtain an unbiased
estimate of the population total,
Then we assume that the NHIS 2013 sample is
collected from a simple-random-sampling, with initial design weights
where
is the population total obtained from ACS, and
is the sample size of NHIS. We calibrate
by a set of demographic and income variables
with traditional GREG calibration and LASSO calibration. Finally, as a
compromise between GREG and LASSO, we consider model-assisted calibration to a
linear model for
instead of the LASSO using (2.7); note that,
when
is computed using the same linear model as in
GREG, the point estimates of the total will correspond, even though the
calibration weights will differ. Thus, we generate seven estimates:
- : Estimate obtained with NHIS
weights.
- : Estimate obtained with
weights
- : Estimate obtained by
calibrating
with GREG using all calibration variables.
- : Estimate obtained by
model-assisted calibration to linear model using predictors in GREG1.
- : Estimate obtained by
calibrating
with GREG using only calibration variables
chosen using backward stepwise variable selection.
- : Estimate obtained by
model-assisted calibration to linear model using predictors in GREG2.
- : Estimate obtained by
model-assisted calibration with LASSO.
The
variance of
is the linearization variance estimate of
total, accounting for sampling-stratum, primary-sampling-units, and survey
weights in the NHIS 2013 sample. Variances of HTSRS, GREG1, and GREG2 are
linearization variance estimates with weights
and
respectively. We obtain the variance of LASSO
estimator through naive bootstrap.
5.3 Working models
Table 5.1
lists calibration variable names, labels, values, and distributions in this
analysis. The first column is the unweighted distribution of variables in the
NHIS sample. The second column contains variable distributions in the NHIS
sample, weighted by
person-level weights. The third column is the
distribution of variables in the population obtained from the ACS benchmark data.
Missing income category is included as a separate category to capture the
difference in missing patterns between NHIS and ACS. Including a missing
category also allows us to maintain the analytic sample size. Relative to ACS,
the unweighted NHIS sample has higher proportions of females,
widowed/divorced/separated, and fewer proportion of non-Hispanic whites. After
weighting, the NHIS distributions of gender and race are close to the
benchmark’s, and only marital status categories show some differences.
We
use an unweighted linear model with backward-stepwise variable selection to
determine the working model for GREG2. The final variables included in the
model for GREG2 are age, education, race, employment status (yes/no), and
family income. For standard GREG and LASSO calibration, we use all available
variables.
Table 5.1
Calibration variables
Table summary
This table displays the results of Calibration variables No weights, NHIS and ACS (appearing as column headers).
|
No weights |
NHIS |
ACS |
| Person-level weights |
Person-level weights |
| Region |
Northeast |
16% |
18% |
18% |
| Midwest |
20% |
23% |
21% |
| South |
37% |
37% |
37% |
| West |
26% |
23% |
23% |
| Age |
18-29 |
19% |
21% |
21% |
| 30-39 |
17% |
17% |
17% |
| 40-49 |
16% |
18% |
18% |
| 50-59 |
17% |
18% |
18% |
| 60-69 |
15% |
14% |
14% |
| 70-79 |
9% |
8% |
8% |
| 80+ |
6% |
4% |
5% |
| Gender |
Male |
45% |
48% |
48% |
| Female |
55% |
52% |
52% |
| Education |
Less than high school |
16% |
14% |
13% |
| High school or less |
26% |
26% |
28% |
| Some college |
20% |
20% |
23% |
| College graduate |
29% |
30% |
25% |
| Post-graduate |
10% |
10% |
10% |
| Race/Ethnicity |
Non-Hispanic white |
60% |
66% |
66% |
| Non-Hispanic black |
15% |
12% |
12% |
| Hispanic |
17% |
15% |
15% |
| Other |
8% |
7% |
7% |
| Marital Status |
Married/partnered |
49% |
60% |
52% |
| Widowed/divorced/separated |
27% |
18% |
20% |
| Never married |
24% |
22% |
28% |
| Employed |
Yes |
35% |
33% |
39% |
| No |
65% |
67% |
61% |
| Income |
1st quartile |
22% |
15% |
19% |
| 2nd quartile |
20% |
17% |
20% |
| 3rd quartile |
21% |
22% |
20% |
| 4th quartile |
21% |
28% |
19% |
| missing |
17% |
19% |
22% |
5.4 Results
Table 5.2
lists the estimates, standard errors (SE), root mean square error treating the
correctly weighted NHIS as the true value (RMSE), percent-deviate from the NHIS
estimate:
and the standard deviation and minimum and
maximum of the weights associated with a given estimator. We treat NHIS
estimate as the unbiased estimate because it is calculated with
probability-based sampling weights provided by NHIS. Without any weighting adjustment,
HTSRS shows a positive bias of 5.9%. The GREG2 estimator reduces this bias from
5.9% to 2.0%, the GREG1 estimator reduces bias to 1.8%, while LASSO estimator
reduces the bias to 0.9%. By definition, use of the model-assisted estimator
using linear predictors will yield the same estimator as the GREG model;
however the variability is substantially reduced. In this analysis, if NHIS
were a non-probability sample, without weighting adjustment, we would have
over-counted the number of adults with cancer by 1.18 million. With traditional
calibration, the error is reduced to an over-count of 365 thousand (without
variable selection) or 392 thousand (with variable selection). LASSO
calibration further reduces the over-count to 175 thousand.
Table 5.2
Results for estimating total number of individuals with cancer. % deviate is the difference to NHIS estimate divided by the NHIS estimate
Table summary
This table displays the results of Results for estimating total number of individuals with cancer. % deviate is the difference to NHIS estimate divided by the NHIS estimate. The information is grouped by Estimator (appearing as row headers), (équation), SE, RMSE, % deviate from NHIS and SD (min, max) of weights (appearing as column headers).
| Estimator |
|
SE |
RMSE |
% deviate from NHIS |
SD (min, max) of weights |
| NHIS |
19,889,327 |
492,263 |
492,263 |
0.00% |
5,913 (168; 93,244) |
| HTSRS |
21,070,498 |
362,883 |
1,235,657 |
5.94% |
0 (6,866; 6,866) |
| GREG1 |
20,254,449 |
375,064 |
523,438 |
1.84% |
2,474 (-2,409; 16,679) |
| GREG1 MC |
20,254,449 |
349,100 |
505,158 |
1.84% |
269 (6,181; 7,326) |
| GREG2 |
20,281,603 |
367,900 |
537,802 |
1.97% |
2,039 (-626; 13,947) |
| GREG2 MC |
20,281,603 |
349,552 |
525,421 |
1.97% |
260 (6,215; 7,291) |
| LASSO |
20,064,671 |
347,586 |
389,309 |
0.88% |
323 (5,786; 7,168) |
As
expected, the standard error of the NHIS estimate is the largest, as it
properly incorporates complex survey design. If the calibration working model
correctly captures the relationship between the outcome variable and the
calibration variables, we anticipate that the calibration estimator standard
errors to be smaller than HTSRS estimator’s. This is not the case for either of
the GREG estimator, where the standard error is larger than HTSRS’s, although
the RMSE is smaller due to the reduction in bias. In addition, the standard
GREG estimator has a standard error about 2.0% greater than the backward selection
GREG estimator, a feature offset by its estimated 6.6% reduction in bias
(although this is insufficient to reduce RMSE); use of the model-assisted GREG
estimator does reduce the standard error, and the root mean square error, by
5-7% and 2-3% respectively, over the standard GREG estimates. For LASSO
calibration, we do observe a smaller standard error than HTSRS’s, even with the
bootstrap variance estimate that tends to overestimate. Without using the
correct design weights, LASSO calibration produced the most accurate estimate
of a population total while providing the smallest standard error among the
estimators in this application. This is in spite of the fact that the standard
deviation of the LASSO calibration weights were only about one-seventh as variable
as the GREG weights, reflected in the smaller standard error of the estimator
itself and greatly reduced RMSE.
ISSN : 1492-0921
Editorial policy
Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.
Submission of Manuscripts
Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).
Note of appreciation
Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.
Standards of service to the public
Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.
Copyright
Published by authority of the Minister responsible for Statistics Canada.
© Her Majesty the Queen in Right of Canada as represented by the Minister of Industry, 2018
Use of this publication is governed by the Statistics Canada Open Licence Agreement.
Catalogue No. 12-001-X
Frequency: Semi-annual
Ottawa