Relative performance of methods based on model-assisted survey regression estimation: A simulation study
Section 3. Simulation study using Financing and Growth of Small and Medium Enterprises Survey data
In this section, we describe a simulation study used to compare the performance of model-assisted survey regression estimators relative to the purely design-based HT estimator. Using the Survey of Financing and Growth of Small and Medium Enterprises data as the population, we compare the estimators in repeated samples of the data to produce estimates of the total amount requested for trade credit which is a particular type of financing.
3.1 Simulation population
The Survey of Financing and Growth of Small and Medium Enterprises (SFGSME) is a periodic survey of enterprises which occurs approximately every three years and collects information on the types of financing businesses use. The sample is stratified by size, defined by the number of employees, the age of the business, industry at the 2-digit North American Industry Classification System (NAICS) and geography. A sample of approximately 17,000 enterprises was selected for the 2017 iteration of the survey.
The Business Register (BR) is the primary source of auxiliary information for business surveys at Statistics Canada. The frame used by the SFGSME was constructed by selecting from Statistic’s Canada BR all enterprises with between 1 and 499 employees and a minimum gross revenue of $30,000. Non-profit enterprises as well as enterprises belonging to certain industry subgroups were excluded from the target population. The BR contains information on the location, number of employees, industry as well as revenue for each enterprise in the population.
3.2 Simulation methodology
We conducted a simulation study to compare the relative performance of several model-assisted survey regression estimators, using three and four categorical auxiliary variables. We considered sample sizes of from the 9,115 respondents in the SFGSME dataset. This dataset was treated as the target population and repeated samples were drawn using stratified simple random sampling as this is the design commonly used by statistical agencies for business surveys. We assumed there are two strata, where stratum A consists of units with revenue of less than $2.5 million and stratum B consists of units with revenue greater than $2.5 million. We assumed equal sample sizes in each stratum but most of the units in the population, approximately 70%, belong to stratum A. Under this sampling design, larger revenue units are over-represented, resulting in an unequal probability sampling design. Preliminary simulations using a simple random sample design were also considered and yielded similar results. The minimum sample size considered was because for smaller sample sizes and 28 categories of -variables, there were often categories without a sampled unit. In this case, it is not possible to calibrate the GREG estimator to all the pre-specified marginal totals.
For each sample, models using three -variables, industry (10 categories), employment size (4 categories) and region (6 categories) were used to estimate total amount of trade credit requested and results were compared to the true total. We also considered a fourth variable, revenue, with 8 categories. For each combination of the three different sample sizes, and the two sets of auxiliary variables, with 20 and 28 main effects categories, we drew 5,000 repeated stratified random samples from the target population. For each sample, we implemented the HT estimator and several model-assisted survey estimators as summarized in Table 3.1 below:
| Estimator | Auxiliary Data | Regression Weights | Calibration Totals |
|---|---|---|---|
| GREG | Marginal totals Considered main effects only |
Independent of | All auxiliary variables |
| GREG with forward variable selection (FSTEP) | Individual values Considered main effects only |
Dependent on | Selected auxiliary variables |
| Regression Tree (TREE) | Individual values | Dependent on strictly positive | Population size of each box |
| Lasso (LASSO) | Individual values Considered main effects (1-way) and two-way interactions (2-way) |
This is an empty cell | This is an empty cell |
| Calibrated lasso (CLASSO) | Individual values Considered main effects (1-way) and two-way interactions (2-way) |
Dependent on | Population size and lasso-fitted mean function |
| Adaptive lasso (ALASSO) | Individual values Considered main effects only |
This is an empty cell | This is an empty cell |
| Calibrated adaptive lasso (CALASSO) | Individual values Considered main effects only |
Dependent on | Population size and lasso-fitted mean function |
We initially also considered adaptive lasso and adaptive lasso calibration estimators using all main effects and 2-way interactions, but estimates of the coefficients under the GREG linear model, were highly unstable leading to singularity issues.
All computations were completed in R (Version 3.4.0, 2017). The HT, GREG, regression tree and lasso estimators were calculated using the package mase (McConville, Tang, Zhu, Li, Cheung and Toth, 2018) and the adaptive lasso coefficients were computed using the package glmnet (Friedman, Hastie, Simon, Qian and Tibshirani, 2017). The function cv.glmnet was used to select the value of the penalty parameter for the lasso estimators. We used a 10-fold cross validation procedure which allows for the inclusion of design weights. For the regression tree estimator, the minimum box size was specified as 25 and the level of significance was 0.05. We also considered a minimum box size of 10 units. For small sample sizes, there was a small gain in efficiency relative to a minimum box size of 25. For sample sizes of different choices for the minimum box size yielded similar results in term of mean square error. Forward stepwise selection for the FSTEP estimator was based on minimizing the Akaike Information Criteria (AIC) and was performed using the function stepAIC in the MASS package (Ripley, Venables, Bates, Hornik, Gebhardt and Firth, 2017).
In regressing the amount of trade credit requested for the entire finite population on the 28 marginal categories, the adjusted coefficient of determination was approximately when both main effects and two-way interaction effects were considered. For the population model with main effects only the number of significant effects was 15 and for the population model with main effects and two-way interactions, there were 2 significant main effects and 29 significant interaction effects. These population-level results indicate that useful predictive models should be sparse and that there may be important two-way interactions.
Fitting regression tree models to the amount of trade credit requested resulted in 25 splits. The first split was based on revenue, indicating that this is the auxiliary data that is most strongly related to the amount of trade credit requested. There were splits based on all four of the auxiliary variables considered: revenue, industry, employment size and geography. This is consistent with the conclusions that useful predictive models should be sparse but allow for higher order interactions.
- Date modified: