A method to find an efficient and robust sampling strategy under model uncertainty
Section 2. Optimal strategy under the superpopulation model

Table of contents

Let $U$ be a finite population of size $N$ with elements labeled ${1, 2, \dots, k, \dots, N} .$ Let $x_{k} = (x_{1 k}, x_{2 k}, \dots, x_{J k})$ be a known vector of values of $J$ auxiliary variables and $y_{k}$ the unknown value of a study variable associated to unit $k \in U .$ We are interested in the estimation of the total of $y,$ $t_{y} = \sum_{U} y_{k} .$

Let $Ω$ be the power set of $U .$ A sample is any subset $s \in Ω$ and a sampling design is a probability distribution on $Ω,$ denoted by $P (S = s)$ or simply $p (s) .$ Let $π_{k} = \sum_{s ∋ k} p (s)$ be the inclusion probability of $k$ and $π_{k l} = \sum_{s \supset {k, l}} p (s)$ the joint inclusion probability of $k$ and $l .$ A probability sampling design is a sampling design such that $π_{k} > 0$ for all $k \in U .$

An estimator is a real valued function of the sample, ${\hat{t}}_{y} = {\hat{t}}_{y} (S) .$ By strategy we refer to the couple sampling design and estimator, $(p (\cdot), {\hat{t}}_{y}) .$

We consider only probability sampling designs with fixed sample size. As a convenient stepping stone we begin by considering unbiased linear estimators of the form

${\hat{t}}_{y} = (\sum_{U} z_{k} - \sum_{s} \frac{z_{k}}{π_{k}}) + \sum_{s} \frac{y_{k}}{π_{k}} = \sum_{U} z_{k} + \sum_{s} \frac{e_{k}}{π_{k}} (2.1)$

with $z_{k}$ arbitrary known constants and $e_{k} = y_{k} - z_{k} .$ This estimator is called the difference estimator. The estimator defined in this way is said to be calibrated on $z$ as it satisfies ${\hat{t}}_{z} = \sum_{U} z_{k} .$ Note that if $z_{k} = 0$ for all $k \in U$ the estimator reduces to ${\hat{t}}_{y} = \sum_{s} y_{k} / π_{k},$ that is, the Horvitz-Thompson estimator (Horvitz and Thompson, 1952). In later sections we focus on the generalized regression estimator (GREG).

The design Mean Squared Error (MSE) of the difference estimator is

${MSE}_{p} ({\hat{t}}_{y}) = {MSE}_{p} (\sum_{s} \frac{e_{k}}{π_{k}}) = \sum_{U} \sum_{U} (π_{k l} - π_{k} π_{l}) \frac{e_{k}}{π_{k}} \frac{e_{l}}{π_{l}} . (2.2)$

As mentioned in the introduction, due to the non-existence of an optimal strategy under the design-based approach, often a superpopulation model, $ξ_{0},$ is proposed and we search for an optimal strategy with respect to the anticipated mean-squared error,

${MSE}_{ξ_{0} p} ({\hat{t}}_{y}) = E_{ξ_{0}} {MSE}_{p} ({\hat{t}}_{y}) = E_{ξ_{0}} E_{p} ({({\hat{t}}_{y} - t_{y})}^{2}) . (2.3)$

We may assume that the $y$ -values are realizations of the following model, denoted $ξ_{0},$

$Y_{k} = f (x_{k} | δ_{1}) + ε_{k}$

with

$E_{ξ_{0}} (ε_{k}) = 0, V_{ξ_{0}} (ε_{k}) = σ_{0}^{2} g {(x_{k} | δ_{2})}^{2} and E_{ξ_{0}} (ε_{k} ε_{l}) = 0 \forall k \neq l (2.4)$

where $δ = (δ_{1}, δ_{2})$ is a vector of parameters, $f : R^{J} \to R$ and $g : R^{J} \to R^{+} .$ The random sample $s$ and the errors $ε_{k}$ are assumed to be independent. Following Rosén (2000), the terms $f (x_{k} | δ_{1})$ and $g (x_{k} | δ_{2}) > 0$ will be called trend and spread, respectively. The term trend should not in general be understood in a temporal sense, rather it refers to the development of $y$ -values with $x .$

Note that under $ξ_{0},$ $e_{k}$ in the difference estimator (2.1) is a random variable that represents the distance between the value of the study variable and $z_{k},$ i.e., $e_{k} = f (x_{k} | δ_{1}) + ε_{k} - z_{k} .$ Therefore $E_{ξ_{0}} e_{k} = f (x_{k} | δ_{1}) - z_{k}$ and $E_{ξ_{0}} e_{k}^{2} = {(f (x_{k} | δ_{1}) - z_{k})}^{2} + σ_{0}^{2} g {(x_{k} | δ_{2})}^{2} .$ With some algebra, it can be seen from (2.2) and (2.3) that the anticipated MSE of the difference estimator becomes

${MSE}_{ξ_{0} p} ({\hat{t}}_{y}) = {MSE}_{p} (\sum_{s} \frac{f (x_{k} | δ_{1}) - z_{k}}{π_{k}}) + σ_{0}^{2} \sum_{U} (\frac{1}{π_{k}} - 1) g {(x_{k} | δ_{2})}^{2} (2.5)$

Nedyalkova and Tillé (2008) derive the anticipated MSE in a more general case.

Tillé and Wilhelm (2017) give the anticipated MSE of the Horvitz-Thompson estimator. The second term in (2.5) is the Godambe-Joshi lower bound (e.g., Särndal, Swensson and Wretman, 1992, page 453).

The anticipated MSE in (2.5) is the sum of two positive terms. It is easy to see that if

the estimator is calibrated on $z_{k} = f (x_{k} | δ_{1})$ the first term vanishes and the anticipated MSE equals the Godambe-Joshi lower bound

${MSE}_{ξ_{0} p} ({\hat{t}}_{y}) = σ_{0}^{2} \sum_{U} (\frac{1}{π_{k}} - 1) g {(x_{k} | δ_{2})}^{2} . (2.6)$

Furthermore, after imposing the fixed sample size restriction $\sum_{U} π_{k} = n,$ if
the design is such that $π_{k} \propto g (x_{k} | δ_{2}),$ denoted $π ps (δ_{2}),$ the second term is minimized and we obtain

${MSE}_{ξ_{0} p}^{opt} ({\hat{t}}_{y}) = σ_{0}^{2} (\frac{1}{n} {(\sum_{U} g (x_{k} | δ_{2}))}^{2} - \sum_{U} g {(x_{k} | δ_{2})}^{2}) .$

Conditions 1 and 2 suggest the specific roles of the design and the estimator in the sampling strategy. The estimator should “explain” the trend in the calibration sense of condition 1. The design should “explain” the spread. A strategy that satisfies conditions 1 and 2 simultaneously will be called optimal. In the same sense, any estimator and any design satisfying, respectively, condition 1 and 2, will be called optimal. As this strategy plays an important role in this paper, we will denote it by $π ps (δ_{2}) - diff (δ_{1}) .$

ISSN : 1492-0921

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Use of this publication is governed by the Statistics Canada Open Licence Agreement.

Catalogue No. 12-001-X

Frequency: Semi-annual

Ottawa

Date modified:: 2021-06-24

Language selection

Search and menus

Search

A method to find an efficient and robust sampling strategy under model uncertainty
Section 2. Optimal strategy under the superpopulation model

A method to find an efficient and robust sampling strategy under model uncertainty Section 2. Optimal strategy under the superpopulation model

Editorial policy

Submission of Manuscripts

Note of appreciation

Standards of service to the public

Copyright

A method to find an efficient and robust sampling strategy under model uncertainty
Section 2. Optimal strategy under the superpopulation model