Strategies for subsampling nonrespondents for economic programs
Section 2. Methodology
2.1 Survey design and
estimation
The general framework for our research is the two-phase
sample design shown in Figure 2.1. The first stage is a stratified
probability sample with a total sample size of
from a finite population (frame) of size
performed before data collection begins. The survey is conducted, and units either respond or do
not. During the data collection, response rates are monitored in
domains, where the domains do not necessarily
equal the sampling strata. For example, total response rates might be monitored
by three-digit industry classification, although these industry sampling strata
are further broken down by size class. Furthermore, the domains could be
independent of the original sampling strata e.g., race or sex categories
(resembling post-strata). Hereafter, the term “domain” refers to the
nonrespondent subsampling strata, indexed by
The second stage of probability sampling occurs at a
predetermined point in the data collection cycle when we select an overall
subsample of size
from the
nonrespondents (a two-phase sample); this
predetermined point can be a fixed calendar date or via a responsive/adaptive
design protocol. The value of
is determined by the program managers, who
take into account the overall budget for NRFU (assumed fixed), mandated
performance measures (e.g., response rates, coefficient of variation
requirements), and other operational considerations such as length of
collection period and available resources. Our allocation procedure determines
the
systematic subsample of size
from the
nonrespondents in each domain. Only the
sampled
units receive NRFU.
Our objective is to estimate
the population total of characteristic
This estimate is
where
is estimated from the
first-stage sample respondents and
is estimated from the
second-stage sample respondents (see Figure 2.1).
Nonresponse adjustments to the
subsampled (responding) units assume a missing
at random response (MAR) mechanism, treated as a Bernoulli sample (Särndal,
Swensson and Wretman, 1992, Chapter 15; Kott, 1994). We consider three different adjustment-to-sample
reweighting estimators of
(Kalton and Flores-Cervantes, 2003): the
double reweighted expansion (DE) estimator (Binder, Babyak, Brodeur, Hidiroglou
and Wisner, 2000; Shao and Thompson, 2009; Haziza, Thompson and Yung, 2010), a
separate ratio (SR) estimator that adjusts for unit nonresponse using a
covariate that is highly correlated with both response propensity and the
survey characteristic of interest (Shao and Thompson, 2009; Haziza et al.,
2010), and a combined ratio (CR) estimator (Binder et al., 2000). Formulae
are provided in the Appendix.
Description for Figure 2.1
Figure illustrating the two-phase sample design. First of all, a first stage probability sample is drawn (size Initial contact strategies are put in place. There are respondent to this first stage and nonrespondents which will be subsampled using a systematic design. The size of this second stage probability sample is These units receive the NRFU procedures. Among these sampled units, there are respondents and nonrespondents.
These estimators require a minimum of
in each domain and a minimum of
for variance estimation. These minimal
conditions may not hold for several reasons. During the early stages of NRFU
collection, an insufficient number of the subsampled units might respond in a
given domain. Alternatively, the allocation procedure could determine that no
subsampling is required in one or more domains. Lastly, the allocation
procedure could require 100-percent follow-up (all units subsampled) in selected domains; henceforth, we refer to
100-percent follow-up/no subsampling as “full follow-up”. In these cases, the
estimation procedure ignores the last stage of sampling as if it did not occur
and produces estimates for domain
using the collapsed estimator formulae
provided in the Appendix.
2.2 Allocation strategies
When all nonresponding cases are subjected to NRFU, respondent contact strategies focus
on improving overall response rates. Analysts might focus primarily on
obtaining responses from soft refusal cases that they believe have similar
characteristics to previous respondents (“quick wins”), although this
phenomenon is more likely when the survey collection is performed in the field,
as with household or agricultural surveys, and perhaps is less likely for
internet or mail collections. With business surveys, the size of the unit is a
factor in the NRFU procedures as discussed in Section 1.
Our objective is to obtain a realized set of respondents
that approximates a random subsample of the originally selected sample via a
probability sample of nonrespondents. With a probability sample, the targeted
cases represent a cross-section of the nonrespondent population. By focusing
contact efforts on the subsample, we hope to decrease the effects of
nonresponse bias on the estimated totals by obtaining data from all types of
nonresponding units. Moreover, weighting or imputation methods may be more
effective at reducing the nonresponse bias effects with a probability subsample
of nonrespondents (Brick, 2013). Even though they do not receive additional
NRFU, the unsampled nonrespondent cases may provide responses later in the
collection cycle. If so, an unbiased estimation procedure would not include the
unsampled late responses in the final estimate assuming that all subsampled
units respond, as these units are represented by the subsampled cases. However,
this procedure is extremely distasteful to many survey managers. Instead, we
include their data in the tabulations as if they had responded before subsampling. This does induce
bias in the estimate. In practice, we ensure that this situation occurs
infrequently by subsampling late in the data collection cycle.
With a business survey that keeps track of little or no
demographic information, most of the information on the nonrespondents such as
industry and unit size (e.g., total payroll, total receipts) is obtained from
the sampling frame. Sorting the nonrespondents within prespecified domains by
unit size and selecting a systematic sample should yield a subsample that
resembles the originally designed sample in terms of unit size composition.
This is especially important for business surveys where responses tend to be
obtained from the larger units (Thompson and Washington, 2013). The choice of
subsampling domain is determined by overall survey objectives such as
publication levels or by the adjustment cell design (e.g., weighting cells or
imputation classes), although computations are considerably simplified when the
domain of interest is the original sampling strata. In the EC, the industry is
the domain of interest.
We consider two allocation approaches: (1)
equal-probability sampling; and (2) optimized allocation with constraints on
unit response rates and sample size in predetermined domains. Equal probability
sampling is easy to implement and should have the lowest sampling variance
among considered nonrespondent subsampling allocation strategies, since the
subsampling weight adjustment will be a constant value in all domains. However,
since the same proportion of nonrespondents is sampled in each domain, the
subsample may not be large enough to offset nonresponse bias effects on totals
in low-responding domains. We refer to the allocations obtained by equal
probability sampling as
where
refers to the overall sampling interval
Our optimized allocation methods address the above
concern by concentrating NRFU efforts in domains that have low response rates,
attempting to select sufficient cases to achieve the performance benchmarks.
This strategy may decrease the nonresponse bias in the totals if the response
mechanism is MAR, conditional on the auxiliary variables used to define the
domains; see Wagner (2012). However, it can increase the variance, as the
subsampling intervals will differ and the weights will become more variable. To
minimize the additional sampling variance caused by differing sampling
intervals, the domain nonrespondent subsampling intervals
should be close to the overall nonrespondent subsampling interval. To control
costs, the allocation should not select more units for NRFU than budgeted.
Recall that the federal survey environment requires that target response rates
be achieved or nearly achieved, which makes all domains “equally” important
from a data collection viewpoint.
To describe the allocation procedures, we introduce
additional notation:
Unit response rate:
Target response rate:
Target domain response rate:
with
units of the
originally sampled units responding before subsampling, leaving
units available for subsampling in each domain.
The unit response rate (URR) is the actual proportion of responding sampled units (Thompson and Oliver, 2012) and does not include an adjustment for
subsampling. The target response rate
used for
allocation is the expected maximum obtainable URR for a given overall
subsampling rate
with
representing the conditional probability of
ultimately responding to the census/survey in domain
given that the unit did not respond prior to
subsampling. In the allocation procedure,
can be modeled from historical data if
available or can be assumed constant for a new survey or for sensitivity
analyses.
We formulate optimized allocation as a quadratic program
and consider two different objective functions. The first quadratic program minimizes
the squared deviation of the target response rate in each domain
from the overall target unit response rate
subject to linear constraints on the size of
nonrespondent sample. This objective function is analogous to the numerator of
the Pearson chi-square goodness-of-fit test.
The second quadratic program minimizes the squared
deviation in domain sampling intervals from the overall sampling interval
subject to linear constraints on the unit
response rates in each domain and on the number of sampled nonrespondents.
Thus, although the optimization procedure allows the sampling intervals to vary
by domain, the program tries to avoid potentially large increases in variance
caused by the deliberately introduced “disproportionate sampling fractions”
referred to in Kish (1992). We refer to the allocations obtained from these
quadratic programs as
and
respectively.
Both quadratic programs are primarily deterministic.
However, recall that at the allocation stage, we must estimate the number of
subsampled units that will eventually respond in each domain. Both quadratic
programs use Constraints (1) through (3) in Table 2.1. Constraint (4) is
included in the
allocation to ensure that the optimization
solution is not
for all domain
There are two limiting scenarios
(preconditions) that are addressed before the
optimization. First, domains whose
before subsampling must be removed from the optimization problem
Second, if the estimated unit response rate
cannot be possibly achieved in a given domain for an assumed
then all units in the domain are selected for
NRFU
The
optimization is applied to the remaining
domains, requiring that these subsampled domains have expected URRs that meet
or exceed the target URRs.
Using sample data containing respondents and
nonrespondents, along with different specified values for
we use the SAS® PROC
NLP (The data analysis for this paper was generated using SAS software.
Copyright, SAS Institute Inc. SAS and all other SAS Institute Inc. product or
service names are registered trademarks or trademarks of SAS Institute Inc.,
Cary, NC, USA.) to solve the quadratic programs (obtaining the set of
). The realized allocations are not integer
values, and the real valued intervals
were input to SAS® PROC
SURVEYSELECT to select stratified systematic subsamples of nonrespondents. As
noted by one reviewer, this yields a solution that is randomly rounded but
constrained at the overall required sample size, and there may be some impact
on reliability due to rounding error. Such effects were not studied in this
paper.
Table 2.1
Optimized allocation quadratic programs
Table summary
This table displays the results of Optimized allocation quadratic programs (équation) and Purpose (appearing as column headers).
|
|
|
Purpose |
Objective Function |
|
|
|
Constraints |
(1) |
|
Selected sample size cannot exceed overall 1-in-K sample size |
(2) |
|
Domain subsample cannot exceed number of nonrespondents in the strata |
(3) |
|
Non-negativity constraint |
(4) |
Not Applicable |
|
Ensures that all domains achieve target URR as feasible. |
ISSN : 1492-0921
Editorial policy
Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.
Submission of Manuscripts
Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).
Note of appreciation
Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.
Standards of service to the public
Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.
Copyright
Published by authority of the Minister responsible for Statistics Canada.
© Her Majesty the Queen in Right of Canada as represented by the Minister of Industry, 2018
Use of this publication is governed by the Statistics Canada Open Licence Agreement.
Catalogue No. 12-001-X
Frequency: Semi-annual
Ottawa