A method to find an efficient and robust sampling strategy under model uncertainty
Section 6. Numerical examples

Table of contents

In Sections 2 and 3 we have established that the strategy $π ps (δ_{2}) - diff (δ_{1})$ is optimal under a superpopulation model, but it is not robust to misspecifications of this model. In Subsection 6.1 we present a small Monte Carlo simulation study carried out to illustrate these results by comparing the optimal strategy and three alternatives.

In Sections 4 and 5 we introduced a measure that allows for quantifying the risk of implementing a sampling design, so allowing to guide the choice of design. In Subsection 6.2 we illustrate the use of the risk measure with real survey data.

6.1 Simulation study under a misspecified model

We compare the efficiency and robustness of four strategies through a simulation study. The strategies to be compared are $π ps$ together with the difference estimator (which is optimal when the model is correct), $π ps$ together with the GREG estimator (optimal design), stratified simple random sampling (STSI) together with the difference estimator (optimal estimator) and STSI together with the GREG estimator.

Our implementation of $π ps$ makes use of Pareto $π ps$ (Rosén, 1997). There is a host of other schemes for drawing $π ps$ samples. Nevertheless, Pareto $π ps$ is a convenient method with good properties, see for example Rosén (2000).

Our implementation of STSI makes use of model-based stratification (Wright, 1983). We consider $H =$ 5 strata with boundaries defined using Dalenius and Hodges (1959) $cum \sqrt{f}$ -rule on $g (x_{k} | δ_{2})$ which is well described in (Särndal et al., 1992, page 463) and the sample is allocated using Neyman allocation, $n_{h} \propto N_{h} S_{g h}$ . Using the $cum \sqrt{f}$ -rule may be suboptimal (see Särndal et al., 1992, page 464) but the efficiency of stratification by a continuous size variable is fairly insensitive to the exact choice of boundaries.

We consider only misspecification of the spread. The trend term is of the form $f (x_{k} | β_{1}) = β_{10} + β_{11} x_{k}^{β_{12}}$ with $β_{10} =$ 1,000, $β_{11} =$ 1 and $β_{12} =$ 0.75, 1 and 1.25. The true spread is $g (x_{k} | β_{2}) = x_{k}^{β_{2}}$ with $β_{2} =$ 0.5, 0.75 and 1. The working spread is $g (x_{k} | δ_{2}) = x_{k}^{δ_{2}}$ with $δ_{2} =$ 0.5, 0.75 and 1.

We will use the difference estimator (2.1) calibrated on $f (x_{k} | β_{1}) .$ Regarding the GREG estimator, we will fix $β_{12},$ whereas the coefficients $β_{10}$ and $β_{11}$ will be estimated.

The simulation is set out as follows. The population size is $N =$ 5,000. The $x$ -values are independent realizations from a gamma distribution with shape $α = 4 / 100$ and scale $λ =$ 1,200 plus one unit, whereas $y_{k}$ is a realization from a gamma distribution with shape and scale

$α_{k} = \frac{{(β_{10} + β_{11} x_{k}^{β_{12}})}^{2}}{σ_{0}^{2} x_{k}^{2 β_{2}}} and λ_{k} = \frac{σ_{0}^{2} x_{k}^{2 β_{2}}}{β_{10} + β_{11} x_{k}^{β_{12}}},$

where $σ^{2}$ was set in such a way that the correlation between $x$ and $y$ is $ρ =$ 0.95. The design MSE of a sample of size $n =$ 500 is then computed for each strategy. Holding the $x$ -values fixed, the process is iterated $B =$ 5,000 times.

Table 6.1 shows the results of the simulation study. The first three columns indicate the model parameters. The fourth column shows the (simulated) model expected MSE of the strategy $π ps$ $-$ dif, whereas the last three columns show the (simulated) efficiency of the strategies $π ps$ $-$ GREG, STSI $-$ dif and STSI $-$ GREG compared to $π ps$ $-$ dif (as a percentage), with efficiency defined as $eff = {MSE}_{ξ, π ps} ({\hat{t}}_{y}) / {MSE}_{ξ, p} ({\hat{t}}_{y})$ where the model expected MSEs are approximated by their simulated counterparts,

${MSE}_{ξ, p} ({\hat{t}}_{y}) = E_{ξ} {MSE}_{p} ({\hat{t}}_{y}) \approx \frac{1}{B} \sum_{r =1}^{B} {MSE}_{p}^{(r)} ({\hat{t}}_{y}),$

in such a way that a value of 100 indicates that the strategy is as efficient as $π ps$ $-$ dif and values smaller (larger) than 100 indicate that the strategy is less (more) efficient than $π ps$ $-$ dif.

The upper part of Table 6.1 shows the case when the working model coincides with the true model. As expected, the strategy that couples $π ps$ with the difference estimator $(π ps$ $-$ dif) was always more efficient than the remaining strategies. Nevertheless, the loss in efficiency due to estimating some parameters through the GREG estimator is negligible. On the other hand, there is a remarkable loss in efficiency due to the use of STSI instead of $π ps$ . Finally, it is noted from (2.6) that as the anticipated MSE for all strategies does not depend on the trend $f$ but only on the spread $g,$ the efficiency remains constant under the same value of $δ_{2},$ independently of the value of $β_{12} .$

Table 6.1
Efficiency of three strategies as a percentage of the model expected MSE of $π ps$ )– dif
Table summary
This table displays the results of Efficiency of three strategies as a percentage of the model expected MSE of $π ps$ – dif $β_{12}$ , $β_{2}$ – dif, $δ_{2}$ $π ps$ – dif, $π ps$ – GREG, STSI – dif and STSI – GREG (appearing as column headers).
	$β_{12}$	$β_{2}$	$δ_{2}$	$π ps$ – dif	$π ps$ – GREG	STSI – dif	STSI – GREG
Correct model	0.75	0.50	0.50	2.78 . 10⁵	99.9	57.3	57.3
	0.75	0.75	0.75	4.82 . 10⁴	99.6	77.9	77.9
	0.75	1.00	1.00	1.90 . 10⁴	99.0	83.2	83.2
	1.00	0.50	0.50	7.64 . 10⁶	99.9	57.3	57.3
	1.00	0.75	0.75	7.20 . 10⁵	99.7	77.9	77.9
	1.00	1.00	1.00	2.14 . 10⁵	99.1	83.1	83.1
	1.25	0.50	0.50	1.46 . 10⁸	99.9	57.3	57.3
	1.25	0.75	0.75	7.85 . 10⁶	99.7	77.9	78.0
	1.25	1.00	1.00	1.81 . 10⁶	99.2	83.1	83.1
Misspecified model	0.75	0.50	0.75	3.98 . 10⁵	99.9	98.9	98.9
	0.75	0.75	1.00	6.45 . 10⁴	99.5	114.5	114.4
	0.75	1.00	0.50	4.73 . 10⁴	100.1	133.9	134.0
	1.00	0.50	1.00	2.14 . 10⁷	99.9	185.6	185.6
	1.00	0.75	0.50	1.03 . 10⁶	100.1	93.1	93.2
	1.00	1.00	0.75	2.77 . 10⁵	99.8	88.9	89.0
	1.25	0.50	0.75	2.09 . 10⁸	99.9	98.9	98.9
	1.25	0.75	1.00	1.05 . 10⁷	99.6	114.5	114.5
	1.25	1.00	0.50	4.50 . 10⁶	100.3	134.0	134.2

The lower part of Table 6.1 shows some comparisons under a misspecified model, in particular, a misspecified spread. It can be noted that even under this mild misspecification of the model, $π ps$ $-$ dif is not necessarily the best strategy anymore as the strategies using STSI were more efficient in several cases. However, it is not evident when will STSI be more efficient than $π ps$ or vice versa. The risk measure introduced in Section 4 can be used to guide the choice between designs. The results shown in this section agree with those shown by for example Holmberg and Swensson (2001).

6.2 Using the risk measure for choosing the design in a real survey

In this subsection we illustrate the implementation of the risk measure using data from a real survey. We want to estimate $t_{y} = \sum_{U} y_{k}$ where $U$ is the set of residential properties in Bogotá, Colombia (of size $N =$ 681,276) and $y_{k}$ is the value of the $k^{th}$ property in 2017 in COP. $x_{k},$ the built-up area of the $k^{th}$ property in square meters, is known for every $k \in U .$ The auxiliary variable $x$ has mean 184, standard deviation 110 and skewness 2.57. The desired sample size is $n =$ 1,000.

We assume that a model of the type $ξ_{0}$ with $f (x_{k} | δ_{1}) = δ_{10} + δ_{11} x_{k}^{δ_{12}}$ and $g (x_{k} | δ_{2}) = x_{k}^{δ_{2}}$ adequately describes the association between $x$ and $y .$ We plan to use the GREG estimator for estimating $δ_{10}$ and $δ_{11},$ i.e., $δ_{1}^{* *} = (δ_{10}, δ_{11}) .$ As this model has the form shown in Example 4, the model expected MSE can be approximated by expression (5.7).

We will use the risk (4.1) in order to assist the decision between $π ps$ or STSI using $H = 6$ strata. We take $h (β_{12}, β_{2})$ as a bivariate normal distribution with no correlation between $β_{12}$ and $β_{2} .$ The integral is approximated using package $cubature$ (Narasimhan, Johnson, Hahn, Bouvier and Kiêu, 2019) developed for the statistical software environment $R$ (R Core Team, 2020).

We consider two cases with different degrees of confidence regarding the working model.

Case 1. In this case no information about $δ_{12},$ $δ_{2}$ or $R_{x, y}$ is available. Naive values of $δ_{12} =$ 1, $δ_{2} =$ 1 and $R_{x, y} =$ 0.75 are considered. In order to reflect the uncertainty, $h (β)$ should have a large variance, therefore we set

$[\begin{matrix} β_{12} \\ β_{2} \end{matrix}] ~ N ([\begin{matrix} 1 .0 \\ 1 .0 \end{matrix}], [\begin{matrix} 0 {.3295}^{2} & 0 \\ 0 & 0 {.3295}^{2} \end{matrix}]) .$

The variance was chosen in such a way that 99% of the mass lies in the circle of radius 1. Evaluation of (4.1) yields $R (π ps) = 6 .89 \cdot 10^{15} β_{11}^{2}$ and $R ( STSI ) = 1 .59 \cdot 10^{15} β_{11}^{2},$ suggesting that a stratified design should be used.

The design MSE of both strategies is computed and we get, ${MSE}_{π ps} ({\hat{t}}_{greg}) = 2 .29 \cdot 10^{25}$ and ${MSE}_{STSI} ({\hat{t}}_{greg}) = 1 .36 \cdot 10^{25} .$ The strategy suggested by (4.1) was indeed the best choice.

Case 2. Using a sample from 2010, prior values of $δ_{12} =$ 1.9, $δ_{2} =$ 2 and $R_{x, y} =$ 0.7 are proposed. As the uncertainty here is smaller than that in Case 1, we set a smaller variance,

$[\begin{matrix} β_{12} \\ β_{2} \end{matrix}] ~ N ([\begin{matrix} 1 .9 \\ 2 .0 \end{matrix}], [\begin{matrix} 0 {.2471}^{2} & 0 \\ 0 & 0 {.2471}^{2} \end{matrix}]) .$

The variance was chosen in such a way that 99% of the mass lies in the circle of radius 0.75. Evaluation of (4.1) yields $R (π ps) = 7 .08 \cdot 10^{22} β_{11}^{2}$ and $R ( STSI ) = 4 .06 \cdot 10^{18} β_{11}^{2},$ suggesting that a stratified design should be used.

The design MSE of both strategies is computed and we get ${MSE}_{π ps} ({\hat{t}}_{greg}) = 1 .85 \cdot 10^{28}$ and ${MSE}_{STSI} ({\hat{t}}_{greg}) = 1 .91 \cdot 10^{25} .$ Note that the use of (4.1) prevented us from using $π ps,$ whose MSE is almost one thousand times bigger than the one under stratified sampling!

ISSN : 1492-0921

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Use of this publication is governed by the Statistics Canada Open Licence Agreement.

Catalogue No. 12-001-X

Frequency: Semi-annual

Ottawa

Date modified:: 2021-06-24

Language selection

Search and menus

Search

A method to find an efficient and robust sampling strategy under model uncertainty
Section 6. Numerical examples

6.1 Simulation study under a misspecified model

6.2 Using the risk measure for choosing the design in a real survey

A method to find an efficient and robust sampling strategy under model uncertainty Section 6. Numerical examples

6.1 Simulation study under a misspecified model

6.2 Using the risk measure for choosing the design in a real survey

Editorial policy

Submission of Manuscripts

Note of appreciation

Standards of service to the public

Copyright

A method to find an efficient and robust sampling strategy under model uncertainty
Section 6. Numerical examples