Relative performance of methods based on model-assisted survey regression estimation: A simulation study
Section 3. Simulation study using Financing and Growth of Small and Medium Enterprises Survey data

In this section, we describe a simulation study used to compare the performance of model-assisted survey regression estimators relative to the purely design-based HT estimator. Using the Survey of Financing and Growth of Small and Medium Enterprises data as the population, we compare the estimators in repeated samples of the data to produce estimates of the total amount requested for trade credit which is a particular type of financing.

3.1   Simulation population

The Survey of Financing and Growth of Small and Medium Enterprises (SFGSME) is a periodic survey of enterprises which occurs approximately every three years and collects information on the types of financing businesses use. The sample is stratified by size, defined by the number of employees, the age of the business, industry at the 2-digit North American Industry Classification System (NAICS) and geography. A sample of approximately 17,000 enterprises was selected for the 2017 iteration of the survey.

The Business Register (BR) is the primary source of auxiliary information for business surveys at Statistics Canada. The frame used by the SFGSME was constructed by selecting from Statistic’s Canada BR all enterprises with between 1 and 499 employees and a minimum gross revenue of $30,000. Non-profit enterprises as well as enterprises belonging to certain industry subgroups were excluded from the target population. The BR contains information on the location, number of employees, industry as well as revenue for each enterprise in the population.

3.2   Simulation methodology

We conducted a simulation study to compare the relative performance of several model-assisted survey regression estimators, using three and four categorical auxiliary variables. We considered sample sizes of n={ 200; 500;1,000 } MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWGUbGaaGjbVlabg2da9iaaysW7daGadaWdaeaapeGaaeOmaiaa bcdacaqGWaGaae4oaiaabckacaaMe8UaaeynaiaabcdacaqGWaGaae 4oaiaaysW7caqGXaGaaeilaiaabcdacaqGWaGaaeimaaGaay5Eaiaa w2haaaaa@4AD8@  from the 9,115 respondents in the SFGSME dataset. This dataset was treated as the target population and repeated samples were drawn using stratified simple random sampling as this is the design commonly used by statistical agencies for business surveys. We assumed there are two strata, where stratum A consists of units with revenue of less than $2.5 million and stratum B consists of units with revenue greater than $2.5 million. We assumed equal sample sizes in each stratum but most of the units in the population, approximately 70%, belong to stratum A. Under this sampling design, larger revenue units are over-represented, resulting in an unequal probability sampling design. Preliminary simulations using a simple random sample design were also considered and yielded similar results. The minimum sample size considered was n=200 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWGUbGaaGjbVlabg2da9iaaysW7caqGYaGaaeimaiaabcdaaaa@3D35@  because for smaller sample sizes and 28 categories of x MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWG4baaaa@3704@  -variables, there were often categories without a sampled unit. In this case, it is not possible to calibrate the GREG estimator to all the pre-specified marginal totals.

For each sample, models using three x MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWG4baaaa@3704@  -variables, industry (10 categories), employment size (4 categories) and region (6 categories) were used to estimate total amount of trade credit requested and results were compared to the true total. We also considered a fourth variable, revenue, with 8 categories. For each combination of the three different sample sizes, and the two sets of auxiliary variables, with 20 and 28 main effects categories, we drew 5,000 repeated stratified random samples from the target population. For each sample, we implemented the HT estimator and several model-assisted survey estimators as summarized in Table 3.1 below:


Table 3.1
Summary of model assisted estimators considered in simulation study
Table summary
This table displays the results of Summary of model assisted estimators considered in simulation study. The information is grouped by Estimator (appearing as row headers), Auxiliary Data , Regression Weights and Calibration Totals (appearing as column headers).
Estimator Auxiliary Data Regression Weights Calibration Totals
GREG Marginal totals
Considered main effects only
Independent of y MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWG5baaaa@3928@ All auxiliary variables
GREG with forward variable selection (FSTEP) Individual values
Considered main effects only
Dependent on y MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWG5baaaa@3928@ Selected auxiliary variables
Regression Tree (TREE) Individual values Dependent on y, MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWG5bGaaiilaaaa@39D8@ strictly positive Population size of each box
Lasso (LASSO) Individual values
Considered main effects (1-way) and two-way interactions (2-way)
This is an empty cell This is an empty cell
Calibrated lasso (CLASSO) Individual values
Considered main effects (1-way) and two-way interactions (2-way)
Dependent on y MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWG5baaaa@3928@ Population size and lasso-fitted mean function
Adaptive lasso (ALASSO) Individual values
Considered main effects only
This is an empty cell This is an empty cell
Calibrated adaptive lasso (CALASSO) Individual values
Considered main effects only
Dependent on y MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWG5baaaa@3928@ Population size and lasso-fitted mean function

We initially also considered adaptive lasso and adaptive lasso calibration estimators using all main effects and 2-way interactions, but estimates of the coefficients under the GREG linear model, β ^ s , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qaceWHYoWdayaajaWaaSbaaSqaa8qacaWGZbaapaqabaGccaGGSaaa aa@3961@  were highly unstable leading to singularity issues.

All computations were completed in R (Version 3.4.0, 2017). The HT, GREG, regression tree and lasso estimators were calculated using the package mase (McConville, Tang, Zhu, Li, Cheung and Toth, 2018) and the adaptive lasso coefficients were computed using the package glmnet (Friedman, Hastie, Simon, Qian and Tibshirani, 2017). The function cv.glmnet was used to select the value of the penalty parameter for the lasso estimators. We used a 10-fold cross validation procedure which allows for the inclusion of design weights. For the regression tree estimator, the minimum box size k( n ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWGRbGaaGPaVpaabmaapaqaa8qacaWGUbaacaGLOaGaayzkaaaa aa@3B1D@  was specified as 25 and the level of significance α MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacqaHXoqyaaa@37A6@  was 0.05. We also considered a minimum box size of 10 units. For small sample sizes, there was a small gain in efficiency relative to a minimum box size of 25. For sample sizes of n=1,000, MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWGUbGaaGjbVlabg2da9iaaykW7caaMe8UaaeymaiaabYcacaqG WaGaaeimaiaabcdacaGGSaaaaa@40D1@  different choices for the minimum box size yielded similar results in term of mean square error. Forward stepwise selection for the FSTEP estimator was based on minimizing the Akaike Information Criteria (AIC) and was performed using the function stepAIC in the MASS package (Ripley, Venables, Bates, Hornik, Gebhardt and Firth, 2017).

In regressing the amount of trade credit requested for the entire finite population on the 28 marginal categories, the adjusted coefficient of determination was approximately R 2 =0.22 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWGsbWdamaaCaaaleqabaWdbiaaikdaaaGcpaGaaGjbVlabg2da 9iaaysW7caqGWaGaaeOlaiaabkdacaqGYaaaaa@3EED@  when both main effects and two-way interaction effects were considered. For the population model with main effects only the number of significant effects was 15 and for the population model with main effects and two-way interactions, there were 2 significant main effects and 29 significant interaction effects. These population-level results indicate that useful predictive models should be sparse and that there may be important two-way interactions.

Fitting regression tree models to the amount of trade credit requested resulted in 25 splits. The first split was based on revenue, indicating that this is the auxiliary data that is most strongly related to the amount of trade credit requested. There were splits based on all four of the auxiliary variables considered: revenue, industry, employment size and geography. This is consistent with the conclusions that useful predictive models should be sparse but allow for higher order interactions.


Date modified: