Relative performance of methods based on model-assisted survey regression estimation: A simulation study
Section 5. Estimation under non-probability sampling
In this section, we study the effect of selection bias
on the survey regression estimators under non-probability sampling. For this
purpose, we studied two types of selection bias possibly present in
non-probability samples. In particular, we considered a scenario in which the
probability of selection depends only on the auxiliary data available for all
units in the population, and a scenario in which the probability of selection
depends on the survey variable of interest. In both scenarios, we evaluated the
absolute relative bias (ARB), for each estimator of the total. Following
Chen, Valliant and Elliott (2018), we treat the non-probability sample as a
simple random sample and set the design weights equal to for the estimation of total as the selection process for non-probability
samples is unknown in practice.
5.1 Selection probabilities depend on auxiliary data
We drew repeated samples using the same stratified SRS design as in Section 4. Table 5.1 displays the ARB of each estimator of the total amount of trade credit requested assuming when the sample is in fact selected using disproportionate stratified random sampling.
As expected, the wholly designed-based HT estimator has the largest bias, and this bias does not decrease as the sample size increases. The ARB of model-assisted estimators decreases as the sample size increases. The GREG estimator has the smallest bias, particularly for small sample sizes. Furthermore, the GREG estimator is approximately unbiased if revenue is included as one of the auxiliary variables for calibration. However, if stepwise variable selection is used, the GREG estimator is no longer unbiased for small sample sizes. On the other hand, if revenue is not included as a calibration variable, the GREG estimator is slightly biased. The lasso-based and, to a smaller extent, the regression tree estimators suffer from small sample bias for when revenue is correctly included as an auxiliary variable. This is most apparent for the standard lasso estimators that do not include calibration to known population totals. For equal to 500 or 1,000, including revenue as an auxiliary variable, substantially decreases the bias for the regression tree and calibrated lasso estimators but only slightly decreases the bias for the lasso estimators without calibration. This indicates that the additional calibration step is important for diminishing the effect of selection bias, especially if the sample size is small.
| Revenue | Without Revenue | |||||
|---|---|---|---|---|---|---|
| 200 | 500 | 1,000 | 200 | 500 | 1,000 | |
| GREG | 0.31 | 0.06 | 0.06 | 4.84 | 5.12 | 4.71 |
| FSTEP | 2.67 | 0.44 | 0.06 | 9.20 | 5.18 | 4.92 |
| TREE | 4.15 | 1.04 | 0.50 | 17.40 | 10.20 | 8.94 |
| LASSO (1-way) | 17.42 | 5.10 | 2.32 | 16.32 | 8.88 | 6.49 |
| CLASSO (1-way) | 7.99 | 0.83 | 0.20 | 9.04 | 5.22 | 4.59 |
| LASSO (2-way) | 25.36 | 14.28 | 8.40 | 26.31 | 15.16 | 9.89 |
| CLASSO (2-way) | 10.72 | 1.44 | 1.02 | 14.19 | 5.56 | 3.84 |
| ALASSO | 14.95 | 5.63 | 3.00 | 14.35 | 8.64 | 6.51 |
| CALASSO | 9.63 | 2.54 | 1.25 | 9.27 | 5.77 | 4.92 |
| HT | 49.45 | 48.84 | 48.81 | 49.08 | 49.29 | 48.60 |
These results indicate that when the selection probability depends on a known auxiliary variable, including it in the working model for the GREG estimator effectively diminishes the effect of selection bias. This was not the case for the model-assisted estimators that involved variable selection. Performing variable selection may increase bias as auxiliary variables that are predictive in terms of selection probability may not be selected and properly accounted for. The lasso estimators can be constructed such that user-specified variables are always included in the working regression model. These user-specified variables can be added to in equation (2.5) to force calibration to corresponding population totals. Unfortunately, the underlying selection mechanism is unknown in practice and, therefore, correctly identifying variables which impact selection probability is challenging.
5.2 Selection probabilities depend on the study variable
Next, we drew repeated samples using Poisson sampling where the sampling probabilities depends on the survey variable of interest. We assume the Poisson sampling probabilities are given by:
where is the amount of trade credit requested in millions of dollars, and The intercept values, were chosen such that we obtained sample sizes of approximately 200, 500 and 1,000 units, averaged over the simulated samples. Under this sampling design, units with larger amounts requested for trade credit have a higher probability of being sampled and, therefore, are over-represented. Table 5.2 displays the ARB of each estimator of the total amount of trade credit requested assuming when the sample is selected using the above informative Poisson sampling. Here, all the estimators are heavily biased because the population model does not hold due to informative sampling. The magnitude of the bias is very similar across estimators and does not substantially decrease as the sample size increases. The inclusion or exclusion of revenue as an auxiliary variable does not impact the bias.
| Revenue | Without Revenue | |||||
|---|---|---|---|---|---|---|
| -3.8 | -2.85 | -2.1 | -3.8 | -2.85 | -2.1 | |
| GREG | 23.53 | 22.27 | 20.45 | 24.74 | 22.91 | 21.21 |
| FSTEP | 24.54 | 22.55 | 20.58 | 25.16 | 23.24 | 21.15 |
| TREE | 24.07 | 22.73 | 20.15 | 24.93 | 22.47 | 20.55 |
| LASSO (1-way) | 24.29 | 22.73 | 20.65 | 25.45 | 23.29 | 21.38 |
| CLASSO (1-way) | 23.02 | 22.30 | 20.47 | 24.74 | 22.99 | 21.23 |
| LASSO (2-way) | 23.15 | 22.06 | 20.17 | 24.66 | 22.73 | 20.62 |
| CLASSO (2-way) | 20.11 | 20.18 | 19.01 | 22.62 | 21.63 | 19.98 |
| ALASSO | 24.44 | 22.72 | 20.66 | 25.50 | 23.21 | 21.36 |
| CALASSO | 23.91 | 22.46 | 20.53 | 25.10 | 23.01 | 21.25 |
| HT | 29.12 | 27.95 | 25.57 | 29.36 | 27.53 | 25.45 |
- Date modified: