Statistical inference with non-probability survey samples
Section 3. Model-based prediction approach
Model-based prediction
methods for finite population parameters require two critical ingredients: the
amount of auxiliary information that is available at the estimation stage and
the reliability of the assumed model for inference. In the absence of any
auxiliary information, the common mean model
may be viewed
as reasonable but the model-based prediction estimator
although
unbiased under the model since
is generally
not an acceptable estimator of
The variance
for the common
mean model is typically large and it renders the estimator
with a
prediction variance that is too large to be practically useful.
3.1 Semiparametric
outcome regression models
Without loss of
generality, we assume that
contains 1 as
its first component corresponding to the intercept of a regression model. Under
the setting described in Section 2, we consider the following
semiparametric model for the finite population, denoted as
where
the mean function
and the
variance function
have known
forms, and the
are also
assumed to be conditionally independent given the
Let
and
be the true
values of the model parameters
and
under the
assumed model. The first major implication of assumption A1 is that
and
The model (3.1)
which is assumed for the finite population also holds for the units in the
non-probability survey sample
The quasi
maximum likelihood estimator
of
is obtained
using the dataset
from the
non-probability survey sample as the solution to the quasi score equations
(McCullagh and Nelder, 1989) given by
The
semiparametric model (3.1) can be extended to replace
by a general
variance function
where
The quasi
maximum likelihood estimation theory covers linear or nonlinear regression
models with the weighted least square estimators, the logistic regression model
and other generalized linear models. Let
and
3.2 Two general forms of prediction estimators
There are two commonly used model-based prediction
estimators for
in the presence of complete auxiliary
information
see Chapter 5 of Wu and Thompson (2020).
Note that
The two prediction estimators are constructed
as
The
estimator
is built based
on
and uses
to predict the
unobserved term
Under a linear
regression model where
the two
estimators given in (3.3) reduce to
where
is the vector
of the population means of the
variables and
is the vector
of the simple sample means of
from the
non-probability sample
If the linear
regression model contains an intercept and
is the ordinary
least square estimator, we have
since
due to the zero
sum of fitted residuals. The prediction estimators in (3.4) under a linear
model only require the population means
in addition to
the non-probability sample
Under the
setting described in Section 2 with auxiliary information on
provided
through a reference probability sample
we simply
replace
by
for the
estimators in (3.3) and substitute
by
for the
estimators in (3.4), where
The population
size
appearing in
(3.3) or (3.4) should also be replaced by
even if it is
known.
3.3 Mass imputation
Model-based prediction estimators of
using a non-probability survey sample on
and a reference probability survey sample on
have traditionally been presented as the mass
imputation estimator. The study variable
is not observed for any units in the reference
survey sample
and hence can be viewed as missing for all
Let
be an imputed value for
The mass imputation estimator of
is then constructed as
where
is defined as before and the subscript “MI”
indicates “Mass Imputation” (not “Multiple Imputation”). Under the
deterministic regression imputation where
the estimator
reduces to the model-based prediction
estimator
as discussed in Section 3.2.
The mass imputation approach to analyzing
non-probability survey samples has the same spirit as model-based prediction
methods but it opens the door for using more flexible models and imputation
techniques that have been developed in the existing literature on missing data
problems. The approach was first examined by Rivers (2007) through the
so-called sample matching method. For each
the “missing”
is imputed as
for some
where
is a matching donor from
selected through the nearest neighbor method
as measured by the distance between
and
The underlying model
for the nearest neighbor imputation method is
nonparametric, i.e.,
for some unknown function
The matching value
can be viewed as the predicted value of the
missing
under the model. Theoretical properties of
estimators based on nearest neighbor imputation were discussed by Chen and Shao
(2000, 2001) for missing survey data problems.
The semiparametric model (3.1) can be used for
deterministic regression mass imputation. Under assumption A1, a consistent estimator
of
is first obtained from the non-probability
sample dataset
and the estimator
is then used to compute the imputed values
for
In other words, the assumption A1 implies the so-called model
transportability by Kim, Park, Chen and Wu (2021): the model which is built
for the non-probability sample can be used for prediction with the reference
probability sample. The resulting mass imputation estimator
is identical to one of the model-based
prediction estimators presented in Section 3.2. Asymptotic properties and
variance estimation for the estimator
using the semiparametric model (3.1) were
discussed by Kim et al. (2021).
Under the mass imputation approach, the only role played
by the observed
for
is to estimate the model parameters
The estimator
is constructed using the fitted model and
auxiliary information from the reference probability sample
It seems that we did not fully use the
information on the observed
given that
is the main parameter of interest. This led to
the research question described in Chapter 17 of Wu and Thompson (2020) on
“reverse sample matching”. The proposed estimator is constructed as
using all the observed
in the non-probability sample, where
The
is a matched survey weight from
such that
with
being the nearest neighbor of
as measured by
Theoretical properties of the reverse matched
estimator
using the nearest neighbor
to match
with
have not been formally investigated in the
existing literature.
Wang, Graubard, Katki and Li (2020) proposed a kernel
weighting approach to reverse sample matching using
where
is a kernel distance between
and
see the adjusted logistic propensity (ALP)
weighting method discussed at the end of Section 4.1.1 on the calculation
of
They showed that the estimator
is consistent under certain regularity
conditions. In a recent working paper posted on arXiv by Liu and Valliant
(2021), the authors discussed issues with the bias and the variance of the
reverse matched estimator under different randomization frameworks involving
one, two or all three of the sources
The authors also proposed a calibration step
over the matched weights, which seems to be a promising idea. Further research
on this topic is needed.
The mass imputation approach to analyzing
non-probability survey samples leads to an interesting research question that
is currently under investigation by a doctoral student at University of
Waterloo: Is it theoretically feasible and practically useful to create a
mass-imputed dataset
based on the reference probability survey
sample that can be used for general statistical inferences? The answer clearly
depends on the types of inferential problems to be conducted over the imputed
dataset. A minimum requirement is that the conditional distribution of the
study variable
given the covariates
is preserved for the mass-imputed dataset. The
nearest neighbor imputation method and the random regression imputation method
can be useful for this purpose. Fractional imputation is another possibility,
especially for binary or ordinal study variables. Multiple imputation is also
potentially useful in this direction to create multiple mass-imputed datasets.
The subscript “MI” in this case might need to be changed to “MI2”,
meaning “Mass Imputation with Multiple Imputation”.
ISSN : 1492-0921
Editorial policy
Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.
Submission of Manuscripts
Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).
Note of appreciation
Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.
Standards of service to the public
Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.
Copyright
Published by authority of the Minister responsible for Statistics Canada.
© His Majesty the King in Right of Canada as represented by the Minister of Industry, 2022
Use of this publication is governed by the Statistics Canada Open Licence Agreement.
Catalogue No. 12-001-X
Frequency: Semi-annual
Ottawa