Relative performance of methods based on model-assisted survey regression estimation: A simulation study
Section 4. Results of the simulation study
4.1 Performance of estimators
in terms of design MSE
We computed design bias and design mean square error
(MSE) from the 5,000 total estimates by sample size and number of marginal
categories. The percentage absolute relative design bias was less than 2
percent for all the estimators for all scenarios. As expected, for all
estimators, the bias decreases as the sample size increases.
Figure 4.1 displays the MSE of the HT, GREG, GREG
with forward variable selection, regression tree and calibrated lasso
estimators by sample size, based on the 5,000 simulated samples. The MSE values
are similar for the adaptive and non-calibrated versions of the lasso
estimators. For all the estimators, the decrease in MSE is much more pronounced
from to than from to This is likely due to the small sample size,
relative to the number of categories for the auxiliary variables. It may not be
possible to explore all the potential effects, particularly higher order
effects, with only 200 sampled units.
Table 4.1 displays the ratio the design MSE of each
estimator to the MSE of the HT estimator for the total amount of trade credit
requested. For the regression tree estimator and the lasso
(2-way) estimator with two factor interaction effects are the only
model-assisted estimators that provide any efficiency gains, relative to the HT
estimator, when the number of categories of auxiliary variables used is large. As
the sample size increases, the gains in efficiency of the model-assisted survey
regression estimators, relative to the HT estimator, are essentially equal.
Using any of the model-assisted estimators when results in a slight gain in efficiency,
relative the HT estimator. There is little efficiency advantage for model-assisted
estimators over the HT estimator, indicating that the auxiliary variables are
not strongly related to the variable of interest.
Description of Figure 4.1
Figure presenting the comparison of mean square error (MSE) for HT (red), GREG (orange), FSTEP (green), regression tree (blue) and calibrated lasso estimators (1-way in purple and 2-way in turquoise) for the total amount of trade credit requested by sample size
(200, 500 and 1,000). For all the estimators, the decrease in MSE is much more pronounced from
to
than from
to
For
the regression tree estimator is the only estimators that provides efficiency gains, relative to the HT estimator. For
and 1,000, all the estimators provide efficiency gains, relative to the HT estimator.
Table 4.1
Ratio of MSE of each estimator to MSE of HT estimator with 20 and 28 marginal categories
Table summary
This table displays the results of Ratio of MSE of each estimator to MSE of HT estimator with 20 and 28 marginal categories 20 categories and 28 categories (appearing as column headers).
|
20 categories |
28 categories |
200 |
500 |
1,000 |
200 |
500 |
1,000 |
GREG |
1.067 |
1.011 |
0.994 |
1.084 |
0.959 |
0.954 |
FSTEP |
1.036 |
1.009 |
0.994 |
1.040 |
0.945 |
0.958 |
TREE |
1.023 |
1.007 |
0.977 |
0.983 |
0.963 |
0.949 |
LASSO (1-way) |
1.020 |
0.995 |
0.986 |
1.009 |
0.946 |
0.947 |
CLASSO (1-way) |
1.047 |
1.004 |
0.990 |
1.042 |
0.952 |
0.949 |
LASSO (2-way) |
0.999 |
0.995 |
0.952 |
0.981 |
0.935 |
0.936 |
CLASSO (2-way) |
1.061 |
1.029 |
0.966 |
1.045 |
0.959 |
0.950 |
ALASSO |
1.024 |
0.999 |
0.986 |
1.021 |
0.948 |
0.948 |
CALASSO |
1.040 |
1.005 |
0.989 |
1.037 |
0.951 |
0.949 |
The potential gains in efficiency for model-assisted
estimators depend on the predictive power of the working model. In our
simulation population, the strength of the relationship between the variable of
interest and the available auxiliary variables is weak, leading to only slight
efficiency gains relative to the purely design-based HT estimator. Therefore,
to further explore the differences between the various model-assisted survey
estimators, we ran additional simulations using different survey variables of
interest, generated according to the following procedure:
- Assuming a lasso
model with main effects only, we obtained the lasso coefficient estimates for
the amount of trade credit requested, using the population values for the auxiliary
variables including revenue.
- We used the
coefficient estimates obtained in step 1 and the population values
for to generate a new survey variable of interest
where is a normally distributed random variable with
mean 0 and standard deviation chosen such that the adjusted coefficient of determination is approximately
- We drew 5,000
repeated samples from the target population and calculated the mean square
error of each estimator of the total
- Steps 1-3 were
repeated by fitting a lasso regression model with main effects and 2-way interactions
and a regression tree model using the algorithm detailed in Section 2.5.
Table 4.2 displays the ratio
the design MSE of each estimator to that of the HT under the three different
models generating the survey variable of interest for a sample size of As expected, the estimator based on the
correctly specified working model is the most efficient. In the case where the
true generating model contains only main effects, assuming a working model with
higher order interactions results in a slight loss in efficiency. If two-way or
higher order interactions are present, the regression tree and lasso-based
estimators fitted with two-way interactions are more efficient than the
model-assisted estimators based on working models with only main effects. When
the generating model is a regression tree, the regression tree estimator yields
modest efficiency gains over the 2-way lasso-based estimators. This can be
explained by the fact that the regression tree model groups the categories of
an auxiliary variable based on their relationship to the variable of interest
and, therefore, reduces the model size. In all cases, significant efficiency
gains, relative to the design-based HT estimator, are achieved.
Table 4.2
Ratio of MSE for each estimator to MSE of HT under different models generating survey variable of interest
Table summary
This table displays the results of Ratio of MSE for each estimator to MSE of HT under different models generating survey variable of interest LASSO (1-way), LASSO (2-way) and Regression Tree (appearing as column headers).
|
LASSO (1-way) |
LASSO (2-way) |
Regression Tree |
GREG |
0.749 |
0.855 |
0.878 |
FSTEP |
0.749 |
0.855 |
0.876 |
TREE |
0.803 |
0.821 |
0.778 |
LASSO (1-way) |
0.747 |
0.850 |
0.871 |
CLASSO (1-way) |
0.747 |
0.851 |
0.873 |
LASSO (2-way) |
0.763 |
0.761 |
0.826 |
CLASSO (2-way) |
0.763 |
0.765 |
0.833 |
ALASSO |
0.750 |
0.849 |
0.872 |
CALASSO |
0.750 |
0.851 |
0.873 |
4.2 Performance
under other scenarios
We also examined the performance of the lasso-based and
regression tree estimators under scenarios where there are no main effects,
only 2-way interactions. We generated a fourth survey variable of interest
using the lasso regression model with main effects and 2-way interactions as
described in the procedure above. However, in step 2, we set all coefficients
estimates corresponding to main effects equal to 0.
The first column of Table 4.3 (called no
multicollinearity) shows the ratio the design MSE of the estimators to that of
the HT estimator, where the survey variable is generated from a model with no
main effects for sample sizes of Under this scenario, the lasso estimators with
2-way interactions and the regression tree estimator are significantly more
efficient than model-assisted estimators based on main effects only models.
Relative to the commonly used GREG estimator, the efficiency gains for the
lasso estimators with 2-way interactions and the regression tree estimator are
significantly greater when there are no main effects. This is evident by comparing
LASSO 2-way column in Table 4.2 to first column in Table 4.3. The
relative MSE is very similar for the 2-way lasso and regression tree estimators
but closer to 1 for GREG and 1-way lasso estimators.
Table 4.3
Ratio of MSE for each estimator to MSE of HT under generating model with no main effects and in the absence/presence of multicollinearity
Table summary
This table displays the results of Ratio of MSE for each estimator to MSE of HT under generating model with no main effects and in the absence/presence of multicollinearity No Multicollinearity, Duplicated Variable and Collapsed Categories (appearing as column headers).
|
No Multicollinearity |
Duplicated Variable |
Collapsed Categories |
GREG |
0.935 |
- |
- |
TREE |
0.824 |
0.850 |
0.842 |
LASSO (1-way) |
0.930 |
0.945 |
0.942 |
CLASSO (1-way) |
0.936 |
0.953 |
0.951 |
LASSO (2-way) |
0.783 |
0.795 |
0.773 |
CLASSO (2-way) |
0.795 |
0.809 |
0.781 |
For administrative data with many variables, it is not
uncommon for some variables to be colinear or nearly colinear. For example,
information on both the total number of employees and the number of full-time
equivalent employees is often available. The GREG estimator, and by extension
the FSTEP estimator and adaptive lasso estimators, fail in the presence of
collinearity as the design matrix is singular. We investigated the performance
for regression tree and lasso estimators in the presence of multicollinearity. We
considered two types of multicollinearity:
-
Duplicate of
existing categorical variable. We created three new indicator variables
corresponding to employment size.
-
Collapsed
categories of existing auxiliary variable: We created a new indicator variable
corresponding to the three highest categories of revenue.
The MSE, relative to the HT estimator, for is shown in columns 2 and 3 of Table 4.3.
These results are very similar to those in the first column of Table 4.3 without
the presence of multicollinearity. The regression tree and lasso estimators
provide an automatic way of removing colinear auxiliary variables without
impacting the potential efficiency gains. It should be noted that other
methods, such as principal component analysis, can be used to eliminate
collinearity but require some expertise.
4.3 Performance of variance
estimators in terms of relative bias
Variance estimators based on (2.8) were constructed for each
estimator. Table 4.4 displays the percentage relative bias of each
estimator for the total amount of trade credit requested. For comparison
purposes, the theoretically unbiased variance estimator of the HT estimator is
included in this table. This variance estimator is equivalent to the expression
provided in (2.8) where The variance estimators for the model-assisted
survey regression estimators have
substantial negative bias which increases as the number of auxiliary variables,
increases. The magnitude of negative bias is
largest for the lasso-based estimators fitted using 2-way interactions. For
small sample sizes, the negative bias is smallest for the regression tree
estimator. As well, for small sample sizes, there is a substantial difference
in bias between the GREG and FSTEP estimators. Performing variable selection
prior to calculating the standard
GREG calibration estimator appears to reduce the bias of the variance
estimator in this case. The bias reduces for all model-assisted survey regression
estimators as the sample size increases.
Table 4.4
Percent relative bias of variance estimators
Table summary
This table displays the results of Percent relative bias of variance estimators 20 categories and 28 categories (appearing as column headers).
|
20 categories |
28 categories |
200 |
500 |
1,000 |
200 |
500 |
1,000 |
GREG |
-12.44 |
-4.16 |
-1.60 |
-22.23 |
-10.86 |
-6.99 |
FSTEP |
-7.05 |
-3.60 |
-1.62 |
-14.07 |
-7.71 |
-6.73 |
TREE |
-5.79 |
-5.53 |
-2.81 |
-8.45 |
-12.93 |
-10.83 |
LASSO (1-way) |
-7.79 |
-2.96 |
-1.14 |
-12.42 |
-9.49 |
-6.44 |
CLASSO (1-way) |
-10.08 |
-3.74 |
-1.61 |
-16.01 |
-9.84 |
-6.52 |
LASSO (2-way) |
-11.94 |
-11.57 |
-7.62 |
-16.12 |
-15.14 |
-13.08 |
CLASSO (2-way) |
-19.99 |
-15.09 |
-9.06 |
-25.87 |
-19.04 |
-15.14 |
ALASSO |
-8.69 |
-3.61 |
-1.41 |
-14.52 |
-9.43 |
-6.38 |
CALASSO |
-9.40 |
-3.78 |
-1.48 |
-15.80 |
-9.64 |
-6.46 |
HT |
5.19 |
5.72 |
5.82 |
4.90 |
-0.11 |
1.66 |
Given the bias of the variance estimators seen here,
particularly for small sample sizes, a possible concern is the quality of the
first-order Taylor expansion approximation. For a large number of categorical
auxiliary variables, the remainder term in the Taylor expansion may no longer
be negligible for small sample sizes. An alternative variance estimator for the
lasso estimators was considered by McConville et al. (2017) but yielded
only slight improvements in terms of bias reduction. An additional concern is
properly accounting for the inherently data driven procedure used to estimate
the regression tree and lasso models. The regression tree model has splits
while the lasso models have a penalty parameter both depending on the sample.
4.4 Properties of the
survey weights
Regression weights are directly available for the GREG,
FSTEP, regression tree, lasso calibration (1-way and 2-way) and adaptive lasso
calibration estimators. We investigated the properties of the weights for these
estimators in our simulations.
Large variation in the values of weights is undesirable
as they allow some units to be much more influential than others. Positive
weights are preferred by national statistical organizations as a negative
weight no longer holds the interpretation of the number of population
units represented by the sampled unit.
First, we computed the average, over
repeated samples, of the empirical within-sample variance of the weights:
where is the simulated sample, and is the
weight of the unit in
the simulated sample.We also computed the average
coefficient of variation (CV) of the weights:
Table 4.5 displays the average
variance and average CV for the weights across samples when revenue was
included as an auxiliary variable. The weights for the GREG estimator and, to a
lesser extent the FSTEP estimator, are much more variable than the weights for the
regression tree and lasso-based estimators, particularly for small sample
sizes. The variability of the weights for the three lasso-based approaches is
very similar and is always slightly lower than the variability of the weights
for the regression tree estimator.
Table 4.5
Average variance (CV) for weights across samples
Table summary
This table displays the results of Average variance (CV) for weights across samples (équation)200, (équation)500 and (équation)1,000 (appearing as column headers).
|
200 |
500 |
1,000 |
GREG |
728.18 (0.59) |
77.14 (0.48) |
16.41 (0.44) |
FSTEP |
462.81 (0.47) |
67.45 (0.45) |
15.90 (0.44) |
TREE |
374.43 (0.42) |
59.35 (0.42) |
14.70 (0.42) |
CLASSO (1-way) |
354.57 (0.41) |
56.21 (0.41) |
14.03 (0.41) |
CLASSO (2-way) |
361.83 (0.42) |
56.60 (0.41) |
14.06 (0.41) |
CALASSO |
354.29 (0.41) |
56.28 (0.41) |
14.03 (0.41) |
We also computed the proportion of simulated samples
where the regression weights contained negative values. As mentioned in Section 2.5,
by construction, the weights for the regression tree estimator are guaranteed
to be strictly positive. When the sample size was 200, the GREG estimator
calibrated to 20 marginal categories yielded negative weights for approximately
3% of the repeated samples. There were no negative weights when the sample size
was 500 or 1,000. For the GREG estimator calibrated to 28 marginal categories,
approximately 27% of the repeated samples of size 200 contained negative
weights and less than 0.5% of the repeated samples of size 500 contained
negative weights. The GREG weights are unstable when the sample size is small,
especially if the GREG estimator is calibrated to auxiliary variables with many
categories. Using forward stepwise variable selection with the GREG estimator
resulted in a substantial decrease in the number of simulated samples with
negative weights for small sample sizes. The FSTEP estimator applied to the 28
marginal categories yielded negative weights in approximately 0.5% of the
repeated samples of size 200. There were no negative weights observed for the
lasso calibration estimator with only main effects or adaptive lasso
calibration estimator. Using the lasso calibration estimator with 2-way
interactions resulted in negative weights in less than 0.05% of the simulated
samples.
4.5 Estimation based
on a single set of weights
A major drawback in the implementation of the regression
tree and the calibrated lasso-based approaches is that the estimation
procedures yield variable-specific weights. We conducted additional simulations
in which a single set of variable-specific weights was applied to other related
survey variables of interest. In the context our business survey data, we
considered four survey variables of interest, the amount of trade credit
requested as well as the amount requested for three additional types of
financing: line of credit, business credit card and leasing financing. We
examined the impact on bias and loss of efficiency in using a single set of
weights, determined by a primary variable of interest, to estimate the total
amount requested for the remaining three survey variables of interest.
Specifically, we calculated the percentage absolute relative design bias for
the estimators of the total amount requested and the variance estimators. We
also calculated the ratio of the MSE for the regression tree and three
calibrated lasso-based approaches using the set of weights corresponding to a
primary variable of interest to the MSE for the estimators using
variable-specific weights. For brevity, we considered only settings with 28
marginal categories.
The percentage absolute relative design bias was less
than 2 percent for all of the estimators for all scenarios. For all estimators
and primary variable of interest, the bias decreases as the sample size
increases.
Unlike the bias of the variance estimators based on
variable-specific weights, the bias of the variance estimators based on a
single set of weights for a primary variable of interest does not necessarily
decrease as the sample size increases. As well, the bias is not strictly in one
direction and may be positive or negative. For the regression tree and
calibrated lasso-based approaches, the bias of the variance estimators is
substantially larger for the primary variable of interest used to calculate the
single set of weights than for the other study variables. The data driven nature
of these estimators means that the estimated variance for the primary variable
of interest is underestimated, as shown in Table 4.4.
Table 4.6 displays the ratio of the design MSE of
each estimator with weights determined by a primary variable of interest to
that of the estimator with variable-specific weights, calculated separately for
each of the four study variables for equal to 200 and 500. Using a single set of
weights determined by a primary variable of interest results in a similar or
slightly higher MSE than using variable-specific weights. Here, the loss in
efficiency is modest, less than 8% in all settings considered. Similar results
were obtained for the case There is no clear pattern in terms of loss of
efficiency and sample size.
Table 4.6
Ratio of MSE for each estimator with weights determined by primary variable of interest to MSE for estimator with variable-specific weights
Table summary
This table displays the results of Ratio of MSE for each estimator with weights determined by primary variable of interest to MSE for estimator with variable-specific weights (équation), Trade Credit, Line of Credit, Business Credit Card and Lease Financing (appearing as column headers).
|
|
Trade Credit |
Line of Credit |
Business Credit Card |
Lease Financing |
200 |
500 |
200 |
500 |
200 |
500 |
200 |
500 |
Primary variable: Trade Credit |
TREE |
- |
- |
1.01 |
0.97 |
0.99 |
1.00 |
0.99 |
1.00 |
CLASSO (1-way) |
- |
- |
0.99 |
0.99 |
1.01 |
0.99 |
1.00 |
1.01 |
CLASSO (2-way) |
- |
- |
0.93 |
0.94 |
0.92 |
0.98 |
0.92 |
0.97 |
CALASSO |
- |
- |
0.97 |
0.99 |
1.01 |
0.99 |
0.96 |
1.00 |
Primary variable: Line of Credit |
TREE |
1.06 |
0.97 |
- |
- |
0.98 |
1.00 |
0.98 |
0.97 |
CLASSO (1-way) |
0.96 |
0.98 |
- |
- |
0.99 |
1.01 |
0.99 |
0.99 |
CLASSO (2-way) |
0.95 |
0.96 |
- |
- |
0.92 |
0.98 |
0.93 |
0.96 |
CALASSO |
0.97 |
0.98 |
- |
- |
0.99 |
1.00 |
0.96 |
0.98 |
Primary variable: Business Credit Card |
TREE |
1.06 |
1.01 |
1.06 |
0.97 |
- |
- |
0.99 |
1.02 |
CLASSO (1-way) |
0.99 |
1.02 |
0.98 |
0.97 |
- |
- |
0.99 |
1.02 |
CLASSO (2-way) |
0.98 |
1.00 |
0.95 |
0.93 |
- |
- |
0.92 |
0.99 |
CALASSO |
1.00 |
1.02 |
0.97 |
0.97 |
- |
- |
1.00 |
1.01 |
Primary variable: Lease Financing |
TREE |
1.07 |
1.03 |
1.06 |
1.05 |
0.99 |
1.02 |
- |
- |
CLASSO (1-way) |
0.99 |
1.05 |
0.98 |
1.04 |
0.99 |
1.02 |
- |
- |
CLASSO (2-way) |
0.97 |
1.02 |
0.96 |
1.01 |
0.92 |
0.99 |
- |
- |
CALASSO |
1.00 |
1.05 |
0.98 |
1.05 |
1.00 |
1.01 |
- |
- |
ISSN : 1492-0921
Editorial policy
Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.
Submission of Manuscripts
Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).
Note of appreciation
Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.
Standards of service to the public
Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.
Copyright
Published by authority of the Minister responsible for Statistics Canada.
© His Majesty the King in Right of Canada as represented by the Minister of Industry, 2022
Use of this publication is governed by the Statistics Canada Open Licence Agreement.
Catalogue No. 12-001-X
Frequency: Semi-annual
Ottawa