5 A third alternative, "Use them both together�
Ken Brewer
Previous | Next
Eventually, a
third position was also offered, the one held by the present author, namely
that since there were merits in both the design-based (or randomization-based)
and the model-based (or prediction-based) approaches, and that since it was
possible to combine them, the two should be used together. I had actually
foreshadowed this possibility in Brewer (1963), a paper that provoked little
interest at the time, but was later spotted and accorded recognition by J.N.K.
Rao, at least to the extent that he invited me to visit him in Ottawa for six
weeks in 1974.
To combine these
two approaches was relatively simple. In each of them there was a variable which was of central interest and
a related or auxiliary variable about which something additional
was known that could be of assistance in estimating the value of that variable. That "something
additional� was typically the known population total of all the values, denoted by Consequently the relationship of central interest, was
that which linked the crucial parameter in equation (4.1) to its cosmetic
estimator namely
where is the probability that unit is selected in the sample, or in the notation
used by Särndal (2011),
where his is identical to my The resulting estimator of the
total is
Särndal (2011)
also shows that these and values can be related to each
other in several different ways, but also shows that there is a common theme
that runs through all of those ways. That common theme is that increases linearly as increases, and that the extent of
that linearity is measured by the parameter in equation (4.1). Importantly,
however, when replaces in Royall's prediction estimator,
the estimator can be shown to be nearly unbiased under the design regardless of
the validity of the assumed model.
Equation (5.2) can
also be found explicitly on page 569 of Brewer (2011), immediately following
its more general formula in matrix notation, namely
When, the question
arises as to how many explanatory variables should be used in the relevant
model, Särndal (2011) makes an apparently disparaging distinction between
"explanatory rich� and "explanatory poor� countries. He certainly treats those
"explanatory poor� countries as being at a substantial disadvantage as a result
of having relatively few "explanators�.
There is at least
one "explanatory rich� country (Australia) that appears to have made a
deliberate decision to ignore whatever advantages might be available to those
that are "explanatory rich�. The current Australian procedure (the one used
primarily to produce seasonally adjusted series) is to use only a single
auxiliary variable, namely the latest available Census total, as the single
"explanator�.
Earlier, Brewer
(1999a) had also presented a case that it might be preferable to use a cosmetic
regression estimator to compensate for any lack of balance, rather than go to
the trouble of selecting balanced samples. However, those who prefer to use
balanced sampling directly can now select randomly from among many balanced or
nearly balanced samples using the "cube method� (Deville and Tillé 2004). That
paper also contains several references to earlier methods of selecting balanced
samples, but regardless of how the relevant balanced sample is arrived at, the
ways in which it needs to be used are identical.
In Brewer and
Gregoire (2009) all three of the relevant approaches to estimation
(randomization alone, prediction alone, and the two together) are examined. At
this point, it is convenient to quote from yet another paper of mine (Brewer
2005, pages 390-391) which sets out the reasons why I was, and still am,
concerned to use both methods simultaneously, and how readily it can be done.
"Each approach has
its merits, and there are advantages in using both together. Consider how each
of these inferences works.
First,
design-based inference. Consider the general case where the inclusion
probabilities are known but may differ from unit
to unit. In that case we can imagine the sampling statistician constructing a
model of the population by looking at each of the sample units in turn and
saying, Oh yes, you (the first unit) were
included with one chance in 10, so my model of the population includes you and
nine other non-sample units with the same value as you. But you (the second unit) you
were included with only one chance in two, so my model includes you and only
one other unit like you.�
The consequence of
using this procedure here was therefore that the model of the population in the
sampler's mind would consist of two real sample units (one from each sample
stratum) plus ten imaginary units, (nine from the stratum with a sample
fraction of one in ten, plus one from the stratum with a sample fraction of one
in two) and finally plus all the units from the completely enumerated stratum.
Brewer (2005, page
391) continues as follows: "So even design-based estimation can be thought of
as being based on a model, but on a model quite different from the prediction
models… that are favoured by the so-called model-based
school. More accurately that school should be described as prediction-based and the design-based
school should be described as randomization-based.
Each school uses a model, but one uses a prediction model and the other a
randomization model.�
The
randomization-based approach described above is the one that was used for the
selection of two sample units (one from each sampled stratum) plus all the
units in the completely enumerated stratum. It also gave rise to the well-known
Horvitz-Thompson estimator, which may be written
where is an inclusion indicator taking
the value "one� if the unit is either in the sample or
in the completely enumerated sector, and the value "zero� otherwise. In this
particular case it is defined over both the two sampled units and also all the
units in the completely enumerated sector. [This last sentence corrects the
error mentioned above.]
Statisticians of
the prediction-based school ridicule the use of randomization-based inference
because the inclusion probabilities are chosen arbitrarily by the sample
designer, and are therefore unable (they say) to tell us anything meaningful
about the population! They prefer instead to use the Best Linear Unbiased
Estimator (BLUE) of the regression parameter as a step towards arriving at the
Best Linear Unbiased Predictor (BLUP) of It is a predictor, because is a random variable under the
model, not a parameter.
Which is then the
better estimator of the HT or the BLUP? The BLUP is
the better if the prediction model holds exactly, and is much the better if
both the sample and the population are small. However there will always be some
sample size beyond which the HT is the more efficient estimator unless the
model holds exactly.
Previous | Next