Relative performance of methods based on model-assisted survey regression estimation: A simulation study
Section 4. Results of the simulation study

4.1   Performance of estimators in terms of design MSE

We computed design bias and design mean square error (MSE) from the 5,000 total estimates by sample size and number of marginal categories. The percentage absolute relative design bias was less than 2 percent for all the estimators for all scenarios. As expected, for all estimators, the bias decreases as the sample size increases.

Figure 4.1 displays the MSE of the HT, GREG, GREG with forward variable selection, regression tree and calibrated lasso estimators by sample size, based on the 5,000 simulated samples. The MSE values are similar for the adaptive and non-calibrated versions of the lasso estimators. For all the estimators, the decrease in MSE is much more pronounced from n=200 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWGUbWdaiaaysW7cqGH9aqpcaaMe8UaaGOmaiaaicdacaaIWaaa aa@3D59@  to n=500 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWGUbWdaiaaysW7cqGH9aqpcaaMe8UaaGynaiaaicdacaaIWaaa aa@3D5C@  than from n=500 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWGUbWdaiaaysW7cqGH9aqpcaaMe8UaaGynaiaaicdacaaIWaaa aa@3D5C@  to n=1,000. MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWGUbWdaiaaysW7cqGH9aqpcaaMe8UaaeymaiaabYcacaqGWaGa aeimaiaabcdacaGGUaaaaa@3F57@  This is likely due to the small sample size, relative to the number of categories for the auxiliary variables. It may not be possible to explore all the potential effects, particularly higher order effects, with only 200 sampled units.

Table 4.1 displays the ratio the design MSE of each estimator to the MSE of the HT estimator for the total amount of trade credit requested. For n=200, MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWGUbWdaiaaysW7cqGH9aqpcaaMe8UaaGOmaiaaicdacaaIWaGa aiilaaaa@3E09@  the regression tree estimator and the lasso (2-way) estimator with two factor interaction effects are the only model-assisted estimators that provide any efficiency gains, relative to the HT estimator, when the number of categories of auxiliary variables used is large. As the sample size increases, the gains in efficiency of the model-assisted survey regression estimators, relative to the HT estimator, are essentially equal. Using any of the model-assisted estimators when n=1,000 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWGUbWdaiaaysW7cqGH9aqpcaaMe8UaaeymaiaabYcacaqGWaGa aeimaiaabcdaaaa@3EA5@  results in a slight gain in efficiency, relative the HT estimator. There is little efficiency advantage for model-assisted estimators over the HT estimator, indicating that the auxiliary variables are not strongly related to the variable of interest.

Figure 4.1

Description of Figure 4.1

Figure presenting the comparison of mean square error (MSE) for HT (red), GREG (orange), FSTEP (green), regression tree (blue) and calibrated lasso estimators (1-way in purple and 2-way in turquoise) for the total amount of trade credit requested by sample size n MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpepC0xe9LqFHe9Lq pepeea0xd9q8as0=LqLs=Jirpepeea0=as0Fb9pgea0lrP0xe9Fve9 Fve9qapdbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamOBaaaa@3956@ (200, 500 and 1,000). For all the estimators, the decrease in MSE is much more pronounced from n=200 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpepC0xe9LqFHe9Lq pepeea0xd9q8as0=LqLs=Jirpepeea0=as0Fb9pgea0lrP0xe9Fve9 Fve9qapdbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamOBaiaays W7cqGH9aqpcaaMe8UaaGOmaiaaicdacaaIWaaaaa@3FA6@ to n=500 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpepC0xe9LqFHe9Lq pepeea0xd9q8as0=LqLs=Jirpepeea0=as0Fb9pgea0lrP0xe9Fve9 Fve9qapdbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamOBaiaays W7cqGH9aqpcaaMe8UaaGynaiaaicdacaaIWaaaaa@3FA9@ than from n=500 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpepC0xe9LqFHe9Lq pepeea0xd9q8as0=LqLs=Jirpepeea0=as0Fb9pgea0lrP0xe9Fve9 Fve9qapdbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamOBaiaays W7cqGH9aqpcaaMe8UaaGynaiaaicdacaaIWaaaaa@3FA9@ to n=1,000. MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpepC0xe9LqFHe9Lq pepeea0xd9q8as0=LqLs=Jirpepeea0=as0Fb9pgea0lrP0xe9Fve9 Fve9qapdbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamOBaiaays W7cqGH9aqpcaaMe8UaaeymaiaabYcacaqGWaGaaeimaiaabcdacaGG Uaaaaa@41A4@ For n=200, MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpepC0xe9LqFHe9Lq pepeea0xd9q8as0=LqLs=Jirpepeea0=as0Fb9pgea0lrP0xe9Fve9 Fve9qapdbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamOBaiaays W7cqGH9aqpcaaMe8UaaGOmaiaaicdacaaIWaGaaiilaaaa@4056@ the regression tree estimator is the only estimators that provides efficiency gains, relative to the HT estimator. For n=500 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpepC0xe9LqFHe9Lq pepeea0xd9q8as0=LqLs=Jirpepeea0=as0Fb9pgea0lrP0xe9Fve9 Fve9qapdbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamOBaiaays W7cqGH9aqpcaaMe8UaaGynaiaaicdacaaIWaaaaa@3FA9@ and 1,000, all the estimators provide efficiency gains, relative to the HT estimator.


Table 4.1
Ratio of MSE of each estimator to MSE of HT estimator with 20 and 28 marginal categories
Table summary
This table displays the results of Ratio of MSE of each estimator to MSE of HT estimator with 20 and 28 marginal categories 20 categories and 28 categories (appearing as column headers).
20 categories 28 categories
n= MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbeqabeWacmGabiqabeqabmqabeabbaGcbaaeaaaaaaaaa8 qacaWGUbWdaiaaysW7cqGH9aqpaaa@3BC9@ 200 n= MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbeqabeWacmGabiqabeqabmqabeabbaGcbaaeaaaaaaaaa8 qacaWGUbWdaiaaysW7cqGH9aqpaaa@3BC9@ 500 n= MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbeqabeWacmGabiqabeqabmqabeabbaGcbaaeaaaaaaaaa8 qacaWGUbWdaiaaysW7cqGH9aqpaaa@3BC9@ 1,000 n= MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbeqabeWacmGabiqabeqabmqabeabbaGcbaaeaaaaaaaaa8 qacaWGUbWdaiaaysW7cqGH9aqpaaa@3BC9@ 200 n= MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbeqabeWacmGabiqabeqabmqabeabbaGcbaaeaaaaaaaaa8 qacaWGUbWdaiaaysW7cqGH9aqpaaa@3BC9@ 500 n= MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbeqabeWacmGabiqabeqabmqabeabbaGcbaaeaaaaaaaaa8 qacaWGUbWdaiaaysW7cqGH9aqpaaa@3BC9@ 1,000
GREG 1.067 1.011 0.994 1.084 0.959 0.954
FSTEP 1.036 1.009 0.994 1.040 0.945 0.958
TREE 1.023 1.007 0.977 0.983 0.963 0.949
LASSO (1-way) 1.020 0.995 0.986 1.009 0.946 0.947
CLASSO (1-way) 1.047 1.004 0.990 1.042 0.952 0.949
LASSO (2-way) 0.999 0.995 0.952 0.981 0.935 0.936
CLASSO (2-way) 1.061 1.029 0.966 1.045 0.959 0.950
ALASSO 1.024 0.999 0.986 1.021 0.948 0.948
CALASSO 1.040 1.005 0.989 1.037 0.951 0.949

The potential gains in efficiency for model-assisted estimators depend on the predictive power of the working model. In our simulation population, the strength of the relationship between the variable of interest and the available auxiliary variables is weak, leading to only slight efficiency gains relative to the purely design-based HT estimator. Therefore, to further explore the differences between the various model-assisted survey estimators, we ran additional simulations using different survey variables of interest, generated according to the following procedure:

  1. Assuming a lasso model with main effects only, we obtained the lasso coefficient estimates for the amount of trade credit requested, y i , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWG5bWdamaaBaaaleaapeGaamyAaaWdaeqaaOGaaiilaaaa@3907@  using the population values for the auxiliary variables x i , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWH4bWdamaaBaaaleaapeGaamyAaaWdaeqaaOGaaiilaaaa@390A@  including revenue.
  2. We used the coefficient estimates β ^ L MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGafqOSdiMbaK aadaWgaaWcbaGaamitaaqabaaaaa@3895@  obtained in step 1 and the population values for x i MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWH4bWdamaaBaaaleaapeGaamyAaaWdaeqaaaaa@3850@  to generate a new survey variable of interest
  3. y i * = x i T β ^ L + u i , MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWG5bWdamaaDaaaleaapeGaamyAaaWdaeaacaGGQaaaaOGaaGjb V=qacqGH9aqpcaaMe8UaaCiEa8aadaqhaaWcbaWdbiaadMgaa8aaba WdbiaadsfaaaGcceWHYoWdayaajaWaaSbaaSqaa8qacaWGmbaapaqa baGccaaMe8+dbiabgUcaRiaaysW7caWH1bWdamaaBaaaleaapeGaam yAaaWdaeqaaOGaaiilaaaa@4A01@                             

    where u i MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWH1bWdamaaBaaaleaapeGaamyAaaWdaeqaaaaa@384D@  is a normally distributed random variable with mean 0 and standard deviation σ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacqaHdpWCaaa@37CA@  chosen such that the adjusted coefficient of determination is approximately R 2 =0.5. MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWGsbWdamaaCaaaleqabaWdbiaaikdaaaGcpaGaaGjbVlabg2da 9iaaykW7caaMe8Uaaeimaiaab6cacaqG1aGaaiOlaaaa@4078@

  4. We drew 5,000 repeated samples from the target population and calculated the mean square error of each estimator of the total t y * . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWG0bWdamaaBaaaleaapeGaamyEa8aadaahaaadbeqaaiaacQca aaaaleqaaOGaaiOlaaaa@39FB@
  5. Steps 1-3 were repeated by fitting a lasso regression model with main effects and 2-way interactions and a regression tree model using the algorithm detailed in Section 2.5.

Table 4.2 displays the ratio the design MSE of each estimator to that of the HT under the three different models generating the survey variable of interest for a sample size of n=1,000. MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamOBaiaays W7cqGH9aqpcaaMc8UaaGjbVlaabgdacaqGSaGaaeimaiaabcdacaqG WaGaaiOlaaaa@40B3@  As expected, the estimator based on the correctly specified working model is the most efficient. In the case where the true generating model contains only main effects, assuming a working model with higher order interactions results in a slight loss in efficiency. If two-way or higher order interactions are present, the regression tree and lasso-based estimators fitted with two-way interactions are more efficient than the model-assisted estimators based on working models with only main effects. When the generating model is a regression tree, the regression tree estimator yields modest efficiency gains over the 2-way lasso-based estimators. This can be explained by the fact that the regression tree model groups the categories of an auxiliary variable based on their relationship to the variable of interest and, therefore, reduces the model size. In all cases, significant efficiency gains, relative to the design-based HT estimator, are achieved.


Table 4.2
Ratio of MSE for each estimator to MSE of HT under different models generating survey variable of interest
Table summary
This table displays the results of Ratio of MSE for each estimator to MSE of HT under different models generating survey variable of interest LASSO (1-way), LASSO (2-way) and Regression Tree (appearing as column headers).
LASSO (1-way) LASSO (2-way) Regression Tree
GREG 0.749 0.855 0.878
FSTEP 0.749 0.855 0.876
TREE 0.803 0.821 0.778
LASSO (1-way) 0.747 0.850 0.871
CLASSO (1-way) 0.747 0.851 0.873
LASSO (2-way) 0.763 0.761 0.826
CLASSO (2-way) 0.763 0.765 0.833
ALASSO 0.750 0.849 0.872
CALASSO 0.750 0.851 0.873

4.2   Performance under other scenarios

We also examined the performance of the lasso-based and regression tree estimators under scenarios where there are no main effects, only 2-way interactions. We generated a fourth survey variable of interest using the lasso regression model with main effects and 2-way interactions as described in the procedure above. However, in step 2, we set all coefficients estimates corresponding to main effects equal to 0.

The first column of Table 4.3 (called no multicollinearity) shows the ratio the design MSE of the estimators to that of the HT estimator, where the survey variable is generated from a model with no main effects for sample sizes of n=1,000. MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamOBaiaays W7cqGH9aqpcaaMc8UaaGjbVlaabgdacaqGSaGaaeimaiaabcdacaqG WaGaaiOlaaaa@40B3@  Under this scenario, the lasso estimators with 2-way interactions and the regression tree estimator are significantly more efficient than model-assisted estimators based on main effects only models. Relative to the commonly used GREG estimator, the efficiency gains for the lasso estimators with 2-way interactions and the regression tree estimator are significantly greater when there are no main effects. This is evident by comparing LASSO 2-way column in Table 4.2 to first column in Table 4.3. The relative MSE is very similar for the 2-way lasso and regression tree estimators but closer to 1 for GREG and 1-way lasso estimators.


Table 4.3
Ratio of MSE for each estimator to MSE of HT under generating model with no main effects and in the absence/presence of multicollinearity
Table summary
This table displays the results of Ratio of MSE for each estimator to MSE of HT under generating model with no main effects and in the absence/presence of multicollinearity No Multicollinearity, Duplicated Variable and Collapsed Categories (appearing as column headers).
No Multicollinearity Duplicated Variable Collapsed Categories
GREG 0.935 - -
TREE 0.824 0.850 0.842
LASSO (1-way) 0.930 0.945 0.942
CLASSO (1-way) 0.936 0.953 0.951
LASSO (2-way) 0.783 0.795 0.773
CLASSO (2-way) 0.795 0.809 0.781

For administrative data with many variables, it is not uncommon for some variables to be colinear or nearly colinear. For example, information on both the total number of employees and the number of full-time equivalent employees is often available. The GREG estimator, and by extension the FSTEP estimator and adaptive lasso estimators, fail in the presence of collinearity as the design matrix is singular. We investigated the performance for regression tree and lasso estimators in the presence of multicollinearity. We considered two types of multicollinearity:

The MSE, relative to the HT estimator, for n=1,000 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamOBaiaays W7cqGH9aqpcaaMc8UaaGjbVlaabgdacaqGSaGaaeimaiaabcdacaqG Waaaaa@4001@  is shown in columns 2 and 3 of Table 4.3. These results are very similar to those in the first column of Table 4.3 without the presence of multicollinearity. The regression tree and lasso estimators provide an automatic way of removing colinear auxiliary variables without impacting the potential efficiency gains. It should be noted that other methods, such as principal component analysis, can be used to eliminate collinearity but require some expertise.

4.3   Performance of variance estimators in terms of relative bias

Variance estimators based on (2.8) were constructed for each estimator. Table 4.4 displays the percentage relative bias of each estimator for the total amount of trade credit requested. For comparison purposes, the theoretically unbiased variance estimator of the HT estimator is included in this table. This variance estimator is equivalent to the expression provided in (2.8) where e i = y i y ¯ s . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWGLbWdamaaBaaaleaapeGaamyAaaWdaeqaaOGaaGjbV=qacqGH 9aqpcaaMe8UaamyEa8aadaWgaaWcbaWdbiaadMgaa8aabeaakiaays W7peGaeyOeI0IaaGjbVlqadMhagaqeamaaBaaaleaacaWGZbaabeaa kiaac6caaaa@45D0@  The variance estimators for the model-assisted survey regression estimators have substantial negative bias which increases as the number of auxiliary variables, p, MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWGWbGaaiilaaaa@37AC@  increases. The magnitude of negative bias is largest for the lasso-based estimators fitted using 2-way interactions. For small sample sizes, the negative bias is smallest for the regression tree estimator. As well, for small sample sizes, there is a substantial difference in bias between the GREG and FSTEP estimators. Performing variable selection prior to calculating the standard GREG calibration estimator appears to reduce the bias of the variance estimator in this case. The bias reduces for all model-assisted survey regression estimators as the sample size increases.


Table 4.4
Percent relative bias of variance estimators
Table summary
This table displays the results of Percent relative bias of variance estimators 20 categories and 28 categories (appearing as column headers).
20 categories 28 categories
n= MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbeqabeWacmGabiqabeqabmqabeabbaGcbaGaamOBaiaays W7cqGH9aqpcaaMc8oaaa@3D25@ 200 n= MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbeqabeWacmGabiqabeqabmqabeabbaGcbaGaamOBaiaays W7cqGH9aqpcaaMc8oaaa@3D25@ 500 n= MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbeqabeWacmGabiqabeqabmqabeabbaGcbaGaamOBaiaays W7cqGH9aqpcaaMc8oaaa@3D25@ 1,000 n= MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbeqabeWacmGabiqabeqabmqabeabbaGcbaGaamOBaiaays W7cqGH9aqpcaaMc8oaaa@3D25@ 200 n= MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbeqabeWacmGabiqabeqabmqabeabbaGcbaGaamOBaiaays W7cqGH9aqpcaaMc8oaaa@3D25@ 500 n= MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbeqabeWacmGabiqabeqabmqabeabbaGcbaGaamOBaiaays W7cqGH9aqpcaaMc8oaaa@3D25@ 1,000
GREG -12.44 -4.16 -1.60 -22.23 -10.86 -6.99
FSTEP -7.05 -3.60 -1.62 -14.07 -7.71 -6.73
TREE -5.79 -5.53 -2.81 -8.45 -12.93 -10.83
LASSO (1-way) -7.79 -2.96 -1.14 -12.42 -9.49 -6.44
CLASSO (1-way) -10.08 -3.74 -1.61 -16.01 -9.84 -6.52
LASSO (2-way) -11.94 -11.57 -7.62 -16.12 -15.14 -13.08
CLASSO (2-way) -19.99 -15.09 -9.06 -25.87 -19.04 -15.14
ALASSO -8.69 -3.61 -1.41 -14.52 -9.43 -6.38
CALASSO -9.40 -3.78 -1.48 -15.80 -9.64 -6.46
HT 5.19 5.72 5.82 4.90 -0.11 1.66

Given the bias of the variance estimators seen here, particularly for small sample sizes, a possible concern is the quality of the first-order Taylor expansion approximation. For a large number of categorical auxiliary variables, the remainder term in the Taylor expansion may no longer be negligible for small sample sizes. An alternative variance estimator for the lasso estimators was considered by McConville et al. (2017) but yielded only slight improvements in terms of bias reduction. An additional concern is properly accounting for the inherently data driven procedure used to estimate the regression tree and lasso models. The regression tree model has splits while the lasso models have a penalty parameter both depending on the sample.

4.4   Properties of the survey weights

Regression weights are directly available for the GREG, FSTEP, regression tree, lasso calibration (1-way and 2-way) and adaptive lasso calibration estimators. We investigated the properties of the weights for these estimators in our simulations.

Large variation in the values of weights is undesirable as they allow some units to be much more influential than others. Positive weights are preferred by national statistical organizations as a negative weight no longer holds the interpretation of the number of population units represented by the sampled unit.

First, we computed the average, over repeated samples, of the empirical within-sample variance of the weights:

var ¯ ( w )= 1 R r=1 R 1 n1 j s ( r ) ( w j ( r ) w ¯ ( r ) ) 2 , MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaWaa0aaaeaaqa aaaaaaaaWdbiaabAhacaqGHbGaaeOCaaaapaGaaGjbV=qadaqadaWd aeaapeGaaC4DaaGaayjkaiaawMcaaiaaysW7caaMc8Uaeyypa0JaaG jbVlaaykW7daWcaaWdaeaapeGaaGymaaWdaeaapeGaamOuaaaacaaM e8+aaabCaeaacaaMc8+aaSaaa8aabaWdbiaaigdaa8aabaWdbiaad6 gacqGHsislcaaIXaaaaaWcbaGaamOCaiaaykW7cqGH9aqpcaaMc8Ua aGymaaqaaiaadkfaa0GaeyyeIuoakiaaykW7daaeqbqaaiaaykW7da qadeqaaiaadEhapaWaa0baaSqaa8qacaWGQbaapaqaa8qadaqadeqa aiaadkhaaiaawIcacaGLPaaaaaGcpaGaaGjbV=qacqGHsislcaaMe8 Uabm4Da8aagaqeamaaCaaaleqabaWdbmaabmqabaGaamOCaaGaayjk aiaawMcaaaaaaOGaayjkaiaawMcaa8aadaahaaWcbeqaa8qacaaIYa aaaaqaaiaadQgacaaMc8UaeyicI4SaaGPaVlaadohadaahaaadbeqa amaabmqabaGaamOCaaGaayjkaiaawMcaaaaaaSqab0GaeyyeIuoaki aacYcaaaa@74A7@

where s ( r ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWGZbWdamaaCaaaleqabaWdbmaabmqabaGaamOCaaGaayjkaiaa wMcaaaaaaaa@39CC@  is the r th MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWGYbWaaWbaaSqabeaacaqG0bGaaeiAaaaaaaa@390D@  simulated sample, w ¯ ( r ) = 1 n j s ( r ) w j ( r ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qaceWG3bWdayaaraWaaWbaaSqabeaapeWaaeWabeaacaWGYbaacaGL OaGaayzkaaaaaOWdaiaaysW7peGaeyypa0JaaGjbVpaaleaaleaaca aIXaaabaGaamOBaaaakiaaysW7daaeqaqaaiaaykW7caWG3bWdamaa DaaaleaapeGaamOAaaWdaeaapeWaaeWabeaacaWGYbaacaGLOaGaay zkaaaaaaqaaiaadQgacaaMc8UaeyicI4SaaGPaVlaadohadaahaaad beqaamaabmqabaGaamOCaaGaayjkaiaawMcaaaaaaSqab0GaeyyeIu oaaaa@5307@  and w j ( r ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWG3bWdamaaDaaaleaapeGaamOAaaWdaeaapeWaaeWabeaacaWG YbaacaGLOaGaayzkaaaaaaaa@3ADE@  is the weight of the j th MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWGQbWaaWbaaSqabeaacaqG0bGaaeiAaaaaaaa@3905@  unit in the r th MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWGYbWaaWbaaSqabeaacaqG0bGaaeiAaaaaaaa@390D@  simulated sample.We also computed the average coefficient of variation (CV) of the weights:

CV ¯ ( w )=  1 R r=1 R 1 n1 j s ( r ) ( w j ( r ) w ¯ ( r ) ) 2 w ¯ ( r )   MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaWaa0aaaeaaqa aaaaaaaaWdbiaaboeacaqGwbaaa8aacaaMc8+dbmaabmaapaqaa8qa caWH3baacaGLOaGaayzkaaGaaGjbVlaaysW7cqGH9aqpcaaMe8UaaG jbVlaacckadaWcaaWdaeaapeGaaGymaaWdaeaapeGaamOuaaaadaae WbqaaiaaysW7daWcaaWdaeaapeWaaOaaa8aabaWdbiaaysW7daWcba WcbaGaaGymaaqaaiaad6gacaaMe8UaeyOeI0IaaGjbVlaaigdaaaGc caaMe8+aaabeaeaacaaMc8+aaeWabeaacaWG3bWdamaaDaaaleaape GaamOAaaWdaeaapeWaaeWabeaacaWGYbaacaGLOaGaayzkaaaaaOWd aiaaysW7peGaeyOeI0IaaGjbVlqadEhapaGbaebadaahaaWcbeqaa8 qadaqadeqaaiaadkhaaiaawIcacaGLPaaaaaaakiaawIcacaGLPaaa paWaaWbaaSqabeaapeGaaGOmaaaaaeaacaWGQbGaaGPaVlabgIGiol aaykW7caWGZbWaaWbaaWqabeaadaqadeqaaiaadkhaaiaawIcacaGL PaaaaaaaleqaniabggHiLdaaleqaaaGcpaqaa8qaceWG3bWdayaara WaaWbaaSqabeaapeWaaeWabeaacaWGYbaacaGLOaGaayzkaaaaaaaa aeaacaWGYbGaaGPaVlabg2da9iaaykW7caaIXaaabaGaamOuaaqdcq GHris5aOGaaiiOaaaa@7C22@

Table 4.5 displays the average variance and average CV for the weights across samples when revenue was included as an auxiliary variable. The weights for the GREG estimator and, to a lesser extent the FSTEP estimator, are much more variable than the weights for the regression tree and lasso-based estimators, particularly for small sample sizes. The variability of the weights for the three lasso-based approaches is very similar and is always slightly lower than the variability of the weights for the regression tree estimator.


Table 4.5
Average variance (CV) for weights across samples
Table summary
This table displays the results of Average variance (CV) for weights across samples (équation)200, (équation)500 and (équation)1,000 (appearing as column headers).
n= MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbeqabeWacmGabiqabeqabmqabeabbaGcbaGaamOBaiaays W7cqGH9aqpcaaMc8oaaa@3D25@ 200 n= MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbeqabeWacmGabiqabeqabmqabeabbaGcbaGaamOBaiaays W7cqGH9aqpcaaMc8oaaa@3D25@ 500 n= MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbeqabeWacmGabiqabeqabmqabeabbaGcbaGaamOBaiaays W7cqGH9aqpcaaMc8oaaa@3D25@ 1,000
GREG 728.18 (0.59) 77.14 (0.48) 16.41 (0.44)
FSTEP 462.81 (0.47) 67.45 (0.45) 15.90 (0.44)
TREE 374.43 (0.42) 59.35 (0.42) 14.70 (0.42)
CLASSO (1-way) 354.57 (0.41) 56.21 (0.41) 14.03 (0.41)
CLASSO (2-way) 361.83 (0.42) 56.60 (0.41) 14.06 (0.41)
CALASSO 354.29 (0.41) 56.28 (0.41) 14.03 (0.41)

We also computed the proportion of simulated samples where the regression weights contained negative values. As mentioned in Section 2.5, by construction, the weights for the regression tree estimator are guaranteed to be strictly positive. When the sample size was 200, the GREG estimator calibrated to 20 marginal categories yielded negative weights for approximately 3% of the repeated samples. There were no negative weights when the sample size was 500 or 1,000. For the GREG estimator calibrated to 28 marginal categories, approximately 27% of the repeated samples of size 200 contained negative weights and less than 0.5% of the repeated samples of size 500 contained negative weights. The GREG weights are unstable when the sample size is small, especially if the GREG estimator is calibrated to auxiliary variables with many categories. Using forward stepwise variable selection with the GREG estimator resulted in a substantial decrease in the number of simulated samples with negative weights for small sample sizes. The FSTEP estimator applied to the 28 marginal categories yielded negative weights in approximately 0.5% of the repeated samples of size 200. There were no negative weights observed for the lasso calibration estimator with only main effects or adaptive lasso calibration estimator. Using the lasso calibration estimator with 2-way interactions resulted in negative weights in less than 0.05% of the simulated samples.

4.5   Estimation based on a single set of weights

A major drawback in the implementation of the regression tree and the calibrated lasso-based approaches is that the estimation procedures yield variable-specific weights. We conducted additional simulations in which a single set of variable-specific weights was applied to other related survey variables of interest. In the context our business survey data, we considered four survey variables of interest, the amount of trade credit requested as well as the amount requested for three additional types of financing: line of credit, business credit card and leasing financing. We examined the impact on bias and loss of efficiency in using a single set of weights, determined by a primary variable of interest, to estimate the total amount requested for the remaining three survey variables of interest. Specifically, we calculated the percentage absolute relative design bias for the estimators of the total amount requested and the variance estimators. We also calculated the ratio of the MSE for the regression tree and three calibrated lasso-based approaches using the set of weights corresponding to a primary variable of interest to the MSE for the estimators using variable-specific weights. For brevity, we considered only settings with 28 marginal categories.

The percentage absolute relative design bias was less than 2 percent for all of the estimators for all scenarios. For all estimators and primary variable of interest, the bias decreases as the sample size increases.

Unlike the bias of the variance estimators based on variable-specific weights, the bias of the variance estimators based on a single set of weights for a primary variable of interest does not necessarily decrease as the sample size increases. As well, the bias is not strictly in one direction and may be positive or negative. For the regression tree and calibrated lasso-based approaches, the bias of the variance estimators is substantially larger for the primary variable of interest used to calculate the single set of weights than for the other study variables. The data driven nature of these estimators means that the estimated variance for the primary variable of interest is underestimated, as shown in Table 4.4.

Table 4.6 displays the ratio of the design MSE of each estimator with weights determined by a primary variable of interest to that of the estimator with variable-specific weights, calculated separately for each of the four study variables for n MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamOBaaaa@36DA@  equal to 200 and 500. Using a single set of weights determined by a primary variable of interest results in a similar or slightly higher MSE than using variable-specific weights. Here, the loss in efficiency is modest, less than 8% in all settings considered. Similar results were obtained for the case n=1,000. MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamOBaiaays W7cqGH9aqpcaaMc8UaaGjbVlaabgdacaqGSaGaaeimaiaabcdacaqG WaGaaiOlaaaa@40B3@  There is no clear pattern in terms of loss of efficiency and sample size.


Table 4.6
Ratio of MSE for each estimator with weights determined by primary variable of interest to MSE for estimator with variable-specific weights
Table summary
This table displays the results of Ratio of MSE for each estimator with weights determined by primary variable of interest to MSE for estimator with variable-specific weights (équation), Trade Credit, Line of Credit, Business Credit Card and Lease Financing (appearing as column headers).
n MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbeqabeWacmGabiqabeqabmqabeabbaGcbaGaamOBaaaa@3907@ Trade Credit Line of Credit Business Credit Card Lease Financing
200 500 200 500 200 500 200 500
Primary variable: Trade Credit TREE - - 1.01 0.97 0.99 1.00 0.99 1.00
CLASSO (1-way) - - 0.99 0.99 1.01 0.99 1.00 1.01
CLASSO (2-way) - - 0.93 0.94 0.92 0.98 0.92 0.97
CALASSO - - 0.97 0.99 1.01 0.99 0.96 1.00
Primary variable: Line of Credit TREE 1.06 0.97 - - 0.98 1.00 0.98 0.97
CLASSO (1-way) 0.96 0.98 - - 0.99 1.01 0.99 0.99
CLASSO (2-way) 0.95 0.96 - - 0.92 0.98 0.93 0.96
CALASSO 0.97 0.98 - - 0.99 1.00 0.96 0.98
Primary variable: Business Credit Card TREE 1.06 1.01 1.06 0.97 - - 0.99 1.02
CLASSO (1-way) 0.99 1.02 0.98 0.97 - - 0.99 1.02
CLASSO (2-way) 0.98 1.00 0.95 0.93 - - 0.92 0.99
CALASSO 1.00 1.02 0.97 0.97 - - 1.00 1.01
Primary variable: Lease Financing TREE 1.07 1.03 1.06 1.05 0.99 1.02 - -
CLASSO (1-way) 0.99 1.05 0.98 1.04 0.99 1.02 - -
CLASSO (2-way) 0.97 1.02 0.96 1.01 0.92 0.99 - -
CALASSO 1.00 1.05 0.98 1.05 1.00 1.01 - -

Date modified: