Model-assisted calibration of non-probability sample survey data using adaptive LASSO
Section 4. Simulation study
We
design a simulation to evaluate the finite sample properties of
and the asymptotic variance estimates of
and
We also consider a naive bootstrap estimator
obtained by drawing 500 samples with
replacement from each simulation sample, as an alternative variance estimator
of
To
simulate non-probability samples, we generate samples with unequal selection
probabilities, but set design weights to
We also consider
(traditional calibration estimator) and
(pure design-based Horvitz-Thompson
estimator). Because
performs both variable selection and
estimation, we implement a backward stepwise selection to select the working
model for GREG. Although there is no theoretical justification for using
stepwise variable selection, Skinner and Silva (1997) have shown that given two
auxiliary variables, a stepwise procedure can result in improved efficiency of
GREG estimator. We are interested in knowing the performance of each estimator
under (1) populations with different signal-to-noise-ratios (SNR), (2)
independent, informative, and biased sampling schemes, and (3) small and large
sample sizes. The signal-to-noise ratio is calculated according to definitions
in Czanner, Sarma, Eden and Brown (2008). We set two levels of correlations
(low/high) between covariates, crossed with two levels of effect sizes
(low/high) of the covariates. We set the low/high and high/low populations to
have the same SNR in order to understand the influence of correlation and
effect size on estimator’s performance given the same SNR. Three sampling
schemes are used to draw samples: simple-random-sampling without replacement,
SRS, Poisson sampling with selection probabilities proportional to covariates,
and Poisson sampling with selection
probabilities proportional to covariates and the outcome,
sampling simulates self-selection bias of
non-probability samples, where the propensity of a respondent to participate in
a study relates to the analysis variable. We consider two sample sizes: 250 and
1,000. Thus we have a total of
experimental groups.
4.1 Population
To
create collinearity among covariates, we follow an auto-decay correlation
structure commonly used in LASSO-related simulations (Tibshirani, 1996):
We generate a population of size
100,000
from a multivariate normal distribution with mean
and covariance
40. The
continuous outcome variable is generated by the regression model:
The binary outcome variable is generated by the logistic regression model:
We set
0.15 for
low correlation population, and
0.73 for
high correlation population. For both continuous and binary outcome variables:
For continuous
for binary
The rest of
Out of 41 regression parameters, 16 are
non-zero and 25 are zero.
4.2 Sampling schemes
Three
sampling schemes are used to generate the sample:
- Simple-Random-Sampling (SRS): selection probabilities
- Poisson sampling with probabilities
proportional to
- Poisson sampling with probabilities
proportional to
and
4.3 Evaluation metrics
We
evaluate empirical bias, variance, and RMSE for each estimator of total. We
evaluate the asymptotic variance estimates and bootstrap variance estimates by
their 95% nominal coverage and %bias relative to empirical variance. We use the
normal approximation to generate confidence intervals. We calculate %bias as
where
is the empirical variance obtained from the
simulation samples.
4.4 Simulation results
The
simulation results are based on
1,000
simulated samples per each experimental group. Table 4.1 lists the
numerical results of bias, variance, and root-mean-square-error of each
estimator under different experimental designs for estimating the total of a
continuous outcome variable. Table 4.2 lists the numerical results for
estimating the total of a binary outcome variable.
4.4.1 Root mean square error
Under
SRS, all estimators are unbiased, and LASSO and GREG perform approximately
equally well relative to HT.
and
induce biased samples by selecting cases with
larger covariate values with higher probabilities. Under
the selection also favors cases with larger
outcome values. The absolute bias of LASSO decreases relative to GREG as SNR
increases. This improvement is more dramatic in the binary case than the
continuous case, especially for
In terms of RMSE, LASSO has marginal
improvement over GREG for estimating totals of continuous outcome variables.
The improvement is slightly noticeable, about 3%, when there are highly
correlated predictors in the model. For the binary setting, there is substantial
improvement in MSE for LASSO over GREG as SNR increases, with reductions of 20%
for the
and nearly 50% for the
setting when SNR is large. In particular,
under Low/High and High/Low population types, the SNR is the same, thus the
difference in performance between LASSO and GREG is attributed to correlation
or effect size. LASSO performs better in both bias and RMSE in High/Low
population type, suggesting that LASSO has stronger advantage over GREG when
there are highly correlated predictors in the model. This suggests that LASSO
has a better variable selection capability in the presence of multicollinearity
relative to stepwise variable selection procedure used in GREG.
Table 4.1
Simulation summary for continuous outcome: total, bias, and RMSE variance
Table summary
This table displays the results of Simulation summary for continuous outcome: total. The information is grouped by Population (appearing as row headers), n, Sampling scheme, HT, GREG and LASSO (appearing as column headers).
| Population |
n |
Sampling scheme |
HT |
GREG |
LASSO |
| bias |
var |
rmse |
bias |
var |
rmse |
bias |
var |
rmse |
low/low
T = 100.8
SNR = 0.47
|
250 |
SRS |
0.5 |
546 |
23.3 |
0.9 |
425 |
20.6 |
0.9 |
428 |
20.7 |
| POI(X) |
12.4 |
525 |
26.0 |
-0.6 |
446 |
21.1 |
-0.4 |
441 |
21.0 |
| POI(X+Y) |
19.4 |
519 |
29.9 |
4.6 |
443 |
21.5 |
4.7 |
431 |
21.3 |
| 1,000 |
SRS |
0.2 |
129 |
11.4 |
0.3 |
94 |
9.6 |
0.3 |
94 |
9.7 |
| POI(X) |
12.6 |
129 |
17.0 |
-0.1 |
91 |
9.5 |
-0.2 |
92 |
9.6 |
| POI(X+Y) |
19.7 |
128 |
22.7 |
4.9 |
91 |
10.7 |
5.0 |
91 |
10.7 |
low/high
T = 101.4
SNR = 1.26
|
250 |
SRS |
0.4 |
849 |
29.1 |
0.9 |
415 |
20.4 |
1.0 |
417 |
20.4 |
| POI(X) |
21.1 |
818 |
35.6 |
-1.3 |
434 |
20.9 |
-1.0 |
432 |
20.8 |
| POI(X+Y) |
31.7 |
817 |
42.7 |
3.7 |
427 |
21.0 |
4.0 |
427 |
21.1 |
| 1,000 |
SRS |
0.0 |
200 |
14.1 |
0.3 |
94 |
10.0 |
0.3 |
93 |
9.7 |
| POI(X) |
21.1 |
199 |
25.4 |
-0.1 |
91 |
9.6 |
-0.2 |
90 |
9.6 |
| POI(X+Y) |
31.7 |
196 |
34.6 |
4.9 |
91 |
10.7 |
4.8 |
89 |
10.6 |
high/low
T = 101.8
SNR = 1.26 |
250 |
SRS |
0.1 |
941 |
30.7 |
1.0 |
421 |
20.6 |
1.0 |
399 |
20.0 |
| POI(X) |
50.2 |
895 |
58.5 |
-0.7 |
434 |
20.8 |
-1.6 |
402 |
20.1 |
| POI(X+Y) |
57.8 |
872 |
64.9 |
4.0 |
435 |
21.2 |
3.0 |
399 |
20.2 |
| 1,000 |
SRS |
0.0 |
218 |
14.8 |
0.3 |
94 |
9.7 |
0.3 |
93 |
9.6 |
| POI(X) |
50.6 |
210 |
53.0 |
-0.1 |
93 |
9.7 |
-0.5 |
91 |
9.6 |
| POI(X+Y) |
58.2 |
209 |
59.9 |
4.7 |
95 |
10.8 |
4.2 |
92 |
10.5 |
high/high
T = 103.1
SNR = 3.41 |
250 |
SRS |
-0.4 |
1,897 |
43.6 |
0.8 |
436 |
20.9 |
1.0 |
407 |
20.2 |
| POI(X) |
83.3 |
1,826 |
93.7 |
-0.8 |
435 |
20.9 |
-1.5 |
406 |
20.2 |
| POI(X+Y) |
96.4 |
1,779 |
105.3 |
3.7 |
428 |
21.0 |
3.0 |
404 |
20.3 |
| 1,000 |
SRS |
-0.2 |
444 |
21.0 |
0.3 |
93 |
9.7 |
0.3 |
93 |
9.7 |
| POI(X) |
83.6 |
424 |
86.1 |
-0.2 |
93 |
9.7 |
-0.5 |
91 |
9.6 |
| POI(X+Y) |
96.9 |
423 |
99.0 |
4.4 |
94 |
10.6 |
4.1 |
92 |
10.4 |
Table 4.2
Simulation summary for binary outcome: total, bias, and RMSE variance
Table summary
This table displays the results of Simulation summary for binary outcome: total. The information is grouped by Population (appearing as row headers), n, Sampling scheme, HT, GREG and LASSO (appearing as column headers).
| Population |
n |
Sampling scheme |
HT |
GREG |
LASSO |
| bias |
var |
rmse |
bias |
var |
rmse |
bias |
var |
rmse |
low/low
T = 56.2
SNR = 0.51 |
250 |
SRS |
0.0 |
10.2 |
3.2 |
0.0 |
7.2 |
2.7 |
0.0 |
7.0 |
2.7 |
| POI(X) |
2.6 |
10.0 |
4.1 |
0.2 |
8.0 |
2.8 |
0.1 |
7.8 |
2.8 |
| POI(X+Y) |
4.9 |
9.8 |
5.8 |
2.0 |
8.1 |
3.5 |
1.8 |
7.8 |
3.3 |
| 1,000 |
SRS |
-0.0 |
2.7 |
1.6 |
0.0 |
1.7 |
1.3 |
0.0 |
1.6 |
1.3 |
| POI(X) |
2.5 |
2.4 |
2.9 |
0.0 |
1.8 |
1.3 |
-0.0 |
1.7 |
1.3 |
| POI(X+Y) |
4.7 |
2.3 |
5.0 |
1.8 |
1.8 |
2.2 |
1.6 |
1.7 |
2.1 |
low/high
T = 54.4
SNR = 1.10 |
250 |
SRS |
-0.0 |
10.8 |
3.3 |
0.0 |
6.1 |
2.5 |
0.1 |
5.4 |
2.3 |
| POI(X) |
3.0 |
10.2 |
4.4 |
0.1 |
6.1 |
2.5 |
0.1 |
5.8 |
2.4 |
| POI(X+Y) |
5.3 |
9.8 |
6.2 |
1.6 |
6.2 |
2.9 |
1.3 |
5.8 |
2.8 |
| 1,000 |
SRS |
-0.0 |
2.7 |
1.6 |
0.0 |
1.3 |
1.1 |
0.0 |
1.1 |
1.0 |
| POI(X) |
2.9 |
2.4 |
3.3 |
0.0 |
1.4 |
1.2 |
-0.1 |
1.2 |
1.1 |
| POI(X+Y) |
5.2 |
2.2 |
5.4 |
1.4 |
1.4 |
1.8 |
1.1 |
1.2 |
1.6 |
high/low
T = 54.2
SNR = 1.10 |
250 |
SRS |
-0.0 |
10.3 |
3.2 |
0.0 |
5.8 |
2.4 |
0.1 |
4.9 |
2.2 |
| POI(X) |
6.6 |
9.6 |
7.3 |
0.3 |
6.2 |
2.5 |
-0.2 |
4.8 |
2.2 |
| POI(X+Y) |
8.6 |
9.3 |
9.1 |
1.8 |
6.3 |
3.1 |
0.9 |
4.9 |
2.4 |
| 1,000 |
SRS |
-0.0 |
2.5 |
1.6 |
0.0 |
1.2 |
1.1 |
0.0 |
1.0 |
1.0 |
| POI(X) |
6.6 |
2.2 |
6.7 |
0.2 |
1.4 |
1.2 |
-0.2 |
1.1 |
1.1 |
| POI(X+Y) |
8.5 |
2.1 |
8.7 |
1.6 |
1.4 |
2.0 |
1.0 |
1.0 |
1.4 |
high/high
T = 52.8
SNR = 2.75 |
250 |
SRS |
-0.1 |
10.2 |
3.1 |
-0.0 |
5.2 |
2.3 |
0.1 |
3.8 |
1.9 |
| POI(X) |
7.1 |
9.8 |
7.8 |
0.3 |
5.7 |
2.4 |
-0.2 |
3.6 |
1.9 |
| POI(X+Y) |
9.1 |
9.4 |
9.6 |
1.5 |
5.7 |
2.8 |
0.5 |
3.7 |
2.0 |
| 1,000 |
SRS |
-0.1 |
2.5 |
1.6 |
-0.0 |
1.1 |
1.0 |
0.0 |
0.6 |
0.8 |
| POI(X) |
7.1 |
2.2 |
7.2 |
0.2 |
1.3 |
1.1 |
-0.2 |
0.7 |
0.9 |
| POI(X+Y) |
9.1 |
2.2 |
9.2 |
1.4 |
1.2 |
1.8 |
0.5 |
0.7 |
1.0 |
4.4.2 LASSO variance estimates
Tables 4.3 and 4.4 list
the 95% nominal coverage and percent-bias for each of the two asymptotic
closed-form variance estimators developed in this research, as well as the
naive bootstrap variance estimate of the LASSO calibration estimator.
For continuous outcomes,
bootstrap variances have coverages that are consistently close to 95% under SRS
and
sampling schemes for both sample sizes. Under
sampling scheme, there is very modest
undercoverage in Table 4.3. The closed-form variances have coverages that
are sensitive to both sample size and sampling scheme, with smaller samples
tending to undercover, particularly for the
sampling scheme. The difference in coverage of
variance estimates between small and large sample sizes is expected, since the
variance estimates are asymptotic and improve over larger samples. In terms of
bias of variance estimators, there is evidence that bias reduces as SNR
increases. With the same SNR, both asymptotic closed-form and bootstrap
variances have smaller bias given predictors with high correlations relative to
predictors with high effect sizes. Closed-form variances tend to underestimate
the empirical variance, especially when the sample size is small. Overall,
there is very little difference between the two closed-form variance estimates.
Bootstrap variance tends to overestimate the empirical variance, but the
absolute bias is generally smaller than those of the closed-form variance estimates.
For binary outcomes, both
asymptotic closed-form and bootstrap variance estimates are sensitive to sample
size, sampling scheme, and SNR. Bootstrap variance coverages are consistently
close to 95% under SRS and
for both sample sizes and all population
types, but coverages range from 75% to 94% under
Under
the bootstrap variance coverages are better
with sample size 250 than with sample size 1,000 when the bias becomes a larger
part of the RMSE, and better with high-correlation populations than with
low-correlation populations. In terms of coverage, closed-form variances show a
similar trend under
as bootstrap: better coverage with smaller
samples than bigger samples, and better coverage with high-correlation
populations than with low-correlation populations. Under SRS and
closed-form variance coverage improves as
sample size increases. In terms of bias, both bootstrap and closed-form
variances have smaller bias with larger sample sizes. Holding sample size
fixed, closed-form variance estimates have larger bias as SNR increases. The
same trend is not observed in bootstrap variance estimates. Similar to continuous
outcome results, closed-form variance tends to underestimate the empirical
variance, especially when the sample size is small. Unlike continuous outcome
results, there is evidence that the
weighted closed-form variance estimates have better
bias-properties than unweighted closed-form variance estimates. The bootstrap
variance tends to overestimate the empirical variance. However, the biases are
much smaller than for the closed-form variance estimates.
Table 4.3
95% nominal coverage and %bias of variance estimates for LASSO
Table summary
This table displays the results of 95% nominal coverage and %bias of variance estimates for LASSO. The information is grouped by Continuous outcome (appearing as row headers), coverage and %bias (appearing as column headers).
| Continuous outcome |
coverage |
%bias |
| Population |
n |
scheme |
|
|
|
|
|
|
| low/low |
250 |
SRS |
91.7% |
91.8% |
95.4% |
-22.6% |
-22.3% |
2.9% |
| POI(X) |
91.2% |
91.2% |
96.1% |
-25.1% |
-24.5% |
5.7% |
| POI(X+Y) |
89.6% |
89.9% |
95.4% |
-23.5% |
-22.8% |
7.9% |
| 1,000 |
SRS |
93.2% |
93.2% |
93.8% |
-7.3% |
-7.2% |
-0.3% |
| POI(X) |
94.0% |
93.9% |
95.5% |
-5.7% |
-5.3% |
6.6% |
| POI(X+Y) |
90.0% |
90.1% |
92.1% |
-4.9% |
-4.4% |
7.9% |
| low/high |
250 |
SRS |
91.5% |
91.5% |
95.7% |
-22.6% |
-22.3% |
6.2% |
| POI(X) |
90.9% |
91.2% |
96.4% |
-25.4% |
-24.9% |
8.8% |
| POI(X+Y) |
90.0% |
90.2% |
95.1% |
-24.5% |
-23.7% |
9.9% |
| 1,000 |
SRS |
93.4% |
93.5% |
94.3% |
-6.6% |
-6.5% |
-0.1% |
| POI(X) |
94.1% |
94.2% |
95.9% |
-4.0% |
-3.5% |
7.6% |
| POI(X+Y) |
90.7% |
90.7% |
92.7% |
-2.9% |
-2.3% |
9.6% |
| high/low |
250 |
SRS |
92.3% |
92.2% |
95.4% |
-17.4% |
-17.1% |
2.0% |
| POI(X) |
92.5% |
92.6% |
95.8% |
-17.9% |
-16.1% |
6.4% |
| POI(X+Y) |
91.2% |
91.8% |
96.5% |
-17.4% |
-15.4% |
7.1% |
| 1,000 |
SRS |
93.5% |
93.5% |
94.4% |
-6.5% |
-6.4% |
-0.9% |
| POI(X) |
94.1% |
94.0% |
95.4% |
-5.0% |
-3.1% |
5.7% |
| POI(X+Y) |
91.9% |
92.3% |
93.4% |
-6.0% |
-3.9% |
5.0% |
| high/high |
250 |
SRS |
92.3% |
92.3% |
95.2% |
-19.6% |
-19.3% |
2.2% |
| POI(X) |
92.0% |
92.3% |
96.1% |
-19.6% |
-17.8% |
7.4% |
| POI(X+Y) |
91.2% |
91.8% |
95.6% |
-19.1% |
-16.9% |
8.3% |
| 1,000 |
SRS |
93.4% |
93.4% |
94.5% |
-6.5% |
-6.4% |
-0.7% |
| POI(X) |
94.0% |
94.5% |
95.6% |
-4.7% |
-2.8% |
6.7% |
| POI(X+Y) |
92.2% |
92.4% |
93.4% |
-5.6% |
-3.3% |
6.1% |
Table 4.4
95% nominal coverage and %bias of variance estimates for LASSO
Table summary
This table displays the results of 95% nominal coverage and %bias of variance estimates for LASSO. The information is grouped by Binary outcome (appearing as row headers), coverage and %bias (appearing as column headers).
| Binary outcome |
coverage |
%bias |
| Population |
n |
scheme |
|
|
|
|
|
|
| low/low |
250 |
SRS |
89.8% |
90.0% |
95.9% |
-28.1% |
-27.8% |
9.2% |
| POI(X) |
88.1% |
88.6% |
96.7% |
-37.3% |
-35.3% |
9.2% |
| POI(X+Y) |
79.0% |
79.9% |
91.2% |
-38.7% |
-35.9% |
8.0% |
| 1,000 |
SRS |
92.8% |
92.8% |
93.5% |
-11.9% |
-11.8% |
-3.5% |
| POI(X) |
92.0% |
92.8% |
95.7% |
-17.9% |
-15.5% |
1.0% |
| POI(X+Y) |
68.6% |
69.6% |
74.6% |
-18.5% |
-14.9% |
0.5% |
| low/high |
250 |
SRS |
86.8% |
87.0% |
94.9% |
-37.7% |
-37.3% |
11.3% |
| POI(X) |
85.4% |
86.1% |
95.5% |
-42.9% |
-41.2% |
14.4% |
| POI(X+Y) |
78.7% |
80.1% |
92.6% |
-44.0% |
-41.3% |
14.4% |
| 1,000 |
SRS |
94.4% |
94.3% |
95.2% |
-5.5% |
-5.4% |
5.8% |
| POI(X) |
91.8% |
92.1% |
94.9% |
-20.5% |
-18.6% |
-1.8% |
| POI(X+Y) |
76.8% |
77.8% |
82.9% |
-20.4% |
-16.9% |
-1.3% |
| high/low |
250 |
SRS |
89.2% |
89.1% |
94.4% |
-28.5% |
-28.1% |
0.4% |
| POI(X) |
89.0% |
90.1% |
95.5% |
-31.9% |
-25.3% |
12.7% |
| POI(X+Y) |
85.7% |
88.4% |
93.8% |
-33.9% |
-25.4% |
10.9% |
| 1,000 |
SRS |
93.9% |
93.9% |
95.6% |
-6.3% |
-6.2% |
3.5% |
| POI(X) |
92.6% |
93.4% |
94.8% |
-16.5% |
-9.2% |
1.9% |
| POI(X+Y) |
83.3% |
85.4% |
88.1% |
-15.0% |
-5.0% |
5.2% |
| high/high |
250 |
SRS |
82.8% |
82.8% |
93.8% |
-44.6% |
-44.3% |
-6.4% |
| POI(X) |
83.6% |
85.5% |
95.1% |
-44.3% |
-39.4% |
3.8% |
| POI(X+Y) |
82.9% |
85.1% |
93.8% |
-45.1% |
-38.4% |
4.6% |
| 1,000 |
SRS |
94.3% |
94.4% |
96.1% |
-7.8% |
-7.6% |
6.3% |
| POI(X) |
91.3% |
92.2% |
94.0% |
-20.0% |
-13.8% |
0.2% |
| POI(X+Y) |
86.3% |
88.6% |
91.5% |
-18.1% |
-9.2% |
2.8% |
ISSN : 1492-0921
Editorial policy
Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.
Submission of Manuscripts
Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).
Note of appreciation
Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.
Standards of service to the public
Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.
Copyright
Published by authority of the Minister responsible for Statistics Canada.
© Her Majesty the Queen in Right of Canada as represented by the Minister of Industry, 2018
Use of this publication is governed by the Statistics Canada Open Licence Agreement.
Catalogue No. 12-001-X
Frequency: Semi-annual
Ottawa