Estimation of response propensities and indicators of representative response using population-level information
Section 2. Population-based response propensities
2.1 General notation
We
suppose that a sample survey is undertaken, where a sample
is selected from a finite population
The sizes of
and
are denoted by
and
respectively. The units in
are labelled
The sample is assumed to be drawn by a
probability sampling design
where the sample
is selected with probability
The first order inclusion probability of unit
is denoted
and
is the design weight. The evaluation study is based on simple random
sampling without replacement. Although large-scale national surveys may use
more complex two-stage designs, many are generally planned so that all survey
units have an equal inclusion probability. We also provide theoretical
expressions under more general complex survey designs.
We
suppose that the survey is subject to unit nonresponse. The set of responding
units is denoted by
so
We denote summation over the respondents,
sample and population by
and
respectively. Let
be the response indicator variable so that
if unit
responds and
otherwise. Hence,
We shall suppose that the typical target of
inference is a population mean
of a survey variable, taking value
for unit
We
suppose that the data available for estimation purposes consists first of the
values
of the survey variable, observed only for
respondents. Secondly, we suppose that information is available on the values
of a vector of auxiliary variables
We shall usually suppose each
is a binary indicator variable, where
represents one or more categorical variables,
since this will be the case in the applications we consider, but our
presentation allows for general
values. We assume that values of
are observed for all respondents so that
is observed.
We
distinguish two settings: one in which
is known for all sample units, i.e., for both
respondents and non-respondents, and one in which
is known only at the aggregate level: the
population total
and/or the population cross-products
We refer to the two types of information as sample-based auxiliary information and aggregate population-based auxiliary
information. The first setting is relevant if the variables making up
are available on a register. However, as
outlined in the introduction, in many countries and surveys the availability of
auxiliary information on non-respondents may be limited and the second setting
using population-based auxiliary information may be more useful.
2.2 Definition of response
propensities
The theory of propensity scores was
introduced by Rosenbaum and Rubin (1983) and discussed in the context of survey
nonresponse by Little (1986; 1988). Response propensities are defined as the
conditional expectation of the response indicator variable
given the values of specified variables and
survey conditions:
where the vector of auxiliary variables is defined as in Section 2.1.
For simplicity, we shall write
and hence denote the response propensity just
by
denotes expectation with respect to the model underlying the response
mechanism. A detailed discussion of response propensities and their properties
is presented in Shlomo et al. (2012). They argue
that it is desirable to select auxiliary variables constituting
in such a way that the missing at random
assumption, denoted MAR (Little and Rubin, 2002), holds as closely as possible.
2.3 Estimation of response
propensities using population-level information
In the case of sample-based
auxiliary information, it is possible to estimate response propensities for all
sampled units by means of regression models
where
is a link function,
is the dependent variable, and
is a vector of explanatory
variables. Generally, the response propensities are modelled by generalized
linear models. Shlomo et al. (2012) use a logistic link function.
In the population-based setting, it
is convenient to consider the identity link function. The identity link
function is a good approximation to the more widely used logistic link function
when response rates are mid-range, between 30% and 70%, which is the typical
response rate obtained in national and other surveys. We demonstrate this fact
in the evaluation study presented in Section 4 where three ranges of
response rates are investigated: low, medium and high. The identity link
function also forms the basis for other representativeness indicators in the
literature, such as the imbalance and distance indicators proposed by Särndal
(2011) some of which are similar to the g-weights calculated in the Generalized
Regression Estimators (GREG).
Under the identity link function we
assume that the true response propensities satisfy the “linear probability
model”
The linear
probability model in (2.1) can be estimated by weighted least squares, where
is
the design weight. The implied estimator of
is given
by
In the case of
population-based auxiliary information, we first note that
and
are
unbiased for
and
respectively and that in large samples we may
expect that
and
It
follows from (2.2) that, in the population-based setting, we may approximate
by
where
We note
that
is
computed only on the set of responding units.
The estimator in (2.3) requires
knowledge of the population sums of squares and cross-products
of the elements of
However the cross-products might be unknown.
In that case, we can estimate
in (2.2) by rewriting
where
may be replaced by
and the covariance matrix
may be
replaced by its estimate using the response set
We can also
estimate (2.6) using propensity weighting by
to
adjust for nonresponse bias in the variance of the response propensities
relative to a set of
variables.
Combining
(2.3), (2.4) and (2.6), we obtain the following estimator:
where
We
therefore distinguish between two types of aggregated population-based
auxiliary information as denoted by the indices
in (2.3) and
in (2.7):
TYPE 1
Full aggregate
population-based auxiliary information: the population cross
products are available,
i.e.,
or
where
TYPE 2
Marginal aggregate
population-based auxiliary information: only the population marginal
counts are available,
i.e.,
The
first type implies that we have available all two-by-two tables, e.g., age
times gender, age times marital status and gender times marital status. This
information might be available to a national statistical institute which has
access to population registers or detailed population demographics and wishes
to use population-based information to monitor data collection due to a lack of
sample-based information on the sample frames. The second type is more
restrictive as we have only frequency counts, e.g., age, gender, marital
status, without any knowledge about the interactions. This information would be
routinely available through websites of national statistical institutes and
therefore can be used by marketing and other data collection agencies to
monitor their data collection.
ISSN : 1492-0921
Editorial policy
Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.
Submission of Manuscripts
Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).
Note of appreciation
Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.
Standards of service to the public
Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.
Copyright
Published by authority of the Minister responsible for Statistics Canada.
© Her Majesty the Queen in Right of Canada as represented by the Minister of Industry, 2019
Use of this publication is governed by the Statistics Canada Open Licence Agreement.
Catalogue No. 12-001-X
Frequency: Semi-annual
Ottawa