Strategies for subsampling nonrespondents for economic programs
Section 2. Methodology

2.1 Survey design and estimation

The general framework for our research is the two-phase sample design shown in Figure 2.1. The first stage is a stratified probability sample with a total sample size of $n$ from a finite population (frame) of size $N,$ performed before data collection begins. The survey is conducted, and units either respond or do not. During the data collection, response rates are monitored in $H$ domains, where the domains do not necessarily equal the sampling strata. For example, total response rates might be monitored by three-digit industry classification, although these industry sampling strata are further broken down by size class. Furthermore, the domains could be independent of the original sampling strata e.g., race or sex categories (resembling post-strata). Hereafter, the term “domain” refers to the nonrespondent subsampling strata, indexed by $h (h = 1, 2, \dots, H) .$

The second stage of probability sampling occurs at a predetermined point in the data collection cycle when we select an overall $1 - in - K$ subsample of size $m_{1}$ from the $m$ nonrespondents (a two-phase sample); this predetermined point can be a fixed calendar date or via a responsive/adaptive design protocol. The value of $K$ is determined by the program managers, who take into account the overall budget for NRFU (assumed fixed), mandated performance measures (e.g., response rates, coefficient of variation requirements), and other operational considerations such as length of collection period and available resources. Our allocation procedure determines the $1 - in - K_{h}$ systematic subsample of size $m_{1 h}$ from the $m_{h}$ nonrespondents in each domain. Only the sampled $m_{1 h}$ units receive NRFU.

Our objective is to estimate $Y,$ the population total of characteristic $y .$ This estimate is $\hat{Y} = {\hat{Y}}_{R 1} + {\hat{Y}}_{R 2}$ where ${\hat{Y}}_{R 1}$ is estimated from the $r_{1 h}$ first-stage sample respondents and ${\hat{Y}}_{R 2}$ is estimated from the $r_{2 h}$ second-stage sample respondents (see Figure 2.1). Nonresponse adjustments to the $r_{2 h}$ subsampled (responding) units assume a missing at random response (MAR) mechanism, treated as a Bernoulli sample (Särndal, Swensson and Wretman, 1992, Chapter 15; Kott, 1994). We consider three different adjustment-to-sample reweighting estimators of ${\hat{Y}}_{R 2}$ (Kalton and Flores-Cervantes, 2003): the double reweighted expansion (DE) estimator (Binder, Babyak, Brodeur, Hidiroglou and Wisner, 2000; Shao and Thompson, 2009; Haziza, Thompson and Yung, 2010), a separate ratio (SR) estimator that adjusts for unit nonresponse using a covariate that is highly correlated with both response propensity and the survey characteristic of interest (Shao and Thompson, 2009; Haziza et al., 2010), and a combined ratio (CR) estimator (Binder et al., 2000). Formulae are provided in the Appendix.

Figure 2.1 Nonrespondent subsample from probability sample, selected during data collection (two-phase sample design). Unsampled nonrespondents do not receive NRFU

Description for Figure 2.1

Figure illustrating the two-phase sample design. First of all, a first stage probability sample is drawn (size $n_{h}) .$ Initial contact strategies are put in place. There are $r_{1 h}$ respondent to this first stage and $m_{h}$ nonrespondents which will be subsampled using a systematic design. The size of this second stage probability sample is $m_{1 h} .$ These units receive the NRFU procedures. Among these sampled units, there are $r_{2 h}$ respondents and $m_{2 h}$ nonrespondents.

These estimators require a minimum of $r_{2 h} = 1$ in each domain and a minimum of $r_{2 h} = 2$ for variance estimation. These minimal conditions may not hold for several reasons. During the early stages of NRFU collection, an insufficient number of the subsampled units might respond in a given domain. Alternatively, the allocation procedure could determine that no subsampling is required in one or more domains. Lastly, the allocation procedure could require 100-percent follow-up (all units subsampled) in selected domains; henceforth, we refer to 100-percent follow-up/no subsampling as “full follow-up”. In these cases, the estimation procedure ignores the last stage of sampling as if it did not occur and produces estimates for domain $h$ using the collapsed estimator formulae provided in the Appendix.

2.2 Allocation strategies

When all nonresponding cases are subjected to NRFU, respondent contact strategies focus on improving overall response rates. Analysts might focus primarily on obtaining responses from soft refusal cases that they believe have similar characteristics to previous respondents (“quick wins”), although this phenomenon is more likely when the survey collection is performed in the field, as with household or agricultural surveys, and perhaps is less likely for internet or mail collections. With business surveys, the size of the unit is a factor in the NRFU procedures as discussed in Section 1.

Our objective is to obtain a realized set of respondents that approximates a random subsample of the originally selected sample via a probability sample of nonrespondents. With a probability sample, the targeted cases represent a cross-section of the nonrespondent population. By focusing contact efforts on the subsample, we hope to decrease the effects of nonresponse bias on the estimated totals by obtaining data from all types of nonresponding units. Moreover, weighting or imputation methods may be more effective at reducing the nonresponse bias effects with a probability subsample of nonrespondents (Brick, 2013). Even though they do not receive additional NRFU, the unsampled nonrespondent cases may provide responses later in the collection cycle. If so, an unbiased estimation procedure would not include the unsampled late responses in the final estimate assuming that all subsampled units respond, as these units are represented by the subsampled cases. However, this procedure is extremely distasteful to many survey managers. Instead, we include their data in the tabulations as if they had responded before subsampling. This does induce bias in the estimate. In practice, we ensure that this situation occurs infrequently by subsampling late in the data collection cycle.

With a business survey that keeps track of little or no demographic information, most of the information on the nonrespondents such as industry and unit size (e.g., total payroll, total receipts) is obtained from the sampling frame. Sorting the nonrespondents within prespecified domains by unit size and selecting a systematic sample should yield a subsample that resembles the originally designed sample in terms of unit size composition. This is especially important for business surveys where responses tend to be obtained from the larger units (Thompson and Washington, 2013). The choice of subsampling domain is determined by overall survey objectives such as publication levels or by the adjustment cell design (e.g., weighting cells or imputation classes), although computations are considerably simplified when the domain of interest is the original sampling strata. In the EC, the industry is the domain of interest.

We consider two allocation approaches: (1) equal-probability sampling; and (2) optimized allocation with constraints on unit response rates and sample size in predetermined domains. Equal probability sampling is easy to implement and should have the lowest sampling variance among considered nonrespondent subsampling allocation strategies, since the subsampling weight adjustment will be a constant value in all domains. However, since the same proportion of nonrespondents is sampled in each domain, the subsample may not be large enough to offset nonresponse bias effects on totals in low-responding domains. We refer to the allocations obtained by equal probability sampling as $Constant - K,$ where $K$ refers to the overall sampling interval $(1 - in - K) .$

Our optimized allocation methods address the above concern by concentrating NRFU efforts in domains that have low response rates, attempting to select sufficient cases to achieve the performance benchmarks. This strategy may decrease the nonresponse bias in the totals if the response mechanism is MAR, conditional on the auxiliary variables used to define the domains; see Wagner (2012). However, it can increase the variance, as the subsampling intervals will differ and the weights will become more variable. To minimize the additional sampling variance caused by differing sampling intervals, the domain nonrespondent subsampling intervals should be close to the overall nonrespondent subsampling interval. To control costs, the allocation should not select more units for NRFU than budgeted. Recall that the federal survey environment requires that target response rates be achieved or nearly achieved, which makes all domains “equally” important from a data collection viewpoint.

To describe the allocation procedures, we introduce additional notation:

Unit response rate: $URR = \frac{\sum_{h} (r_{1 h} + r_{2 h})}{\sum_{h} n_{h}}$

Target response rate: ${URR}^{T} = \frac{\sum_{h} r_{1 h} + (q_{h} m_{h} / K)}{\sum_{h} n_{h}}$

Target domain response rate: ${URR}_{h}^{T} = \frac{r_{1 h} + (q_{h} m_{h} / K_{h})}{n_{h}}$

with $r_{1 h}$ units of the $n_{h}$ originally sampled units responding before subsampling, leaving $m_{h}$ units available for subsampling in each domain. The unit response rate (URR) is the actual proportion of responding sampled units (Thompson and Oliver, 2012) and does not include an adjustment for subsampling. The target response rate $({URR}^{T})$ used for allocation is the expected maximum obtainable URR for a given overall subsampling rate $K,$ with $q_{h}$ representing the conditional probability of ultimately responding to the census/survey in domain $h,$ given that the unit did not respond prior to subsampling. In the allocation procedure, $q_{h}$ can be modeled from historical data if available or can be assumed constant for a new survey or for sensitivity analyses.

We formulate optimized allocation as a quadratic program and consider two different objective functions. The first quadratic program minimizes the squared deviation of the target response rate in each domain ${URR}_{h}^{T}$ from the overall target unit response rate ${URR}^{T} ,$ subject to linear constraints on the size of nonrespondent sample. This objective function is analogous to the numerator of the Pearson chi-square goodness-of-fit test.

The second quadratic program minimizes the squared deviation in domain sampling intervals from the overall sampling interval $(K)$ subject to linear constraints on the unit response rates in each domain and on the number of sampled nonrespondents. Thus, although the optimization procedure allows the sampling intervals to vary by domain, the program tries to avoid potentially large increases in variance caused by the deliberately introduced “disproportionate sampling fractions” referred to in Kish (1992). We refer to the allocations obtained from these quadratic programs as $Min - URR$ and $Min - K$ respectively.

Both quadratic programs are primarily deterministic. However, recall that at the allocation stage, we must estimate the number of subsampled units that will eventually respond in each domain. Both quadratic programs use Constraints (1) through (3) in Table 2.1. Constraint (4) is included in the $Min - K$ allocation to ensure that the optimization solution is not $K_{h} = K$ for all domain $h .$ There are two limiting scenarios (preconditions) that are addressed before the $Min - K$ optimization. First, domains whose ${URR}_{h}^{T} \geq {URR}^{T}$ before subsampling must be removed from the optimization problem $(K_{h} = \infty) .$ Second, if the estimated unit response rate cannot be possibly achieved in a given domain for an assumed $q_{h},$ then all units in the domain are selected for NRFU $(K_{h} = 1) .$ The $Min - K$ optimization is applied to the remaining domains, requiring that these subsampled domains have expected URRs that meet or exceed the target URRs.

Using sample data containing respondents and nonrespondents, along with different specified values for $q_{h},$ we use the SAS^® PROC NLP (The data analysis for this paper was generated using SAS software. Copyright, SAS Institute Inc. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc., Cary, NC, USA.) to solve the quadratic programs (obtaining the set of $K_{h}$ ). The realized allocations are not integer values, and the real valued intervals $(K_{h})$ were input to SAS^® PROC SURVEYSELECT to select stratified systematic subsamples of nonrespondents. As noted by one reviewer, this yields a solution that is randomly rounded but constrained at the overall required sample size, and there may be some impact on reliability due to rounding error. Such effects were not studied in this paper.

Table 2.1
Optimized allocation quadratic programs
Table summary
This table displays the results of Optimized allocation quadratic programs (équation) and Purpose (appearing as column headers).
		$Min - URR$	$Min - K$	Purpose
Objective Function		$min \sum_{h} {({URR}_{h}^{T} - {URR}^{T})}^{2}$	$min \sum_{h} {(K_{h} - K)}^{2} = min \sum_{h} {(\frac{m_{h}}{m_{1 h}} - K)}^{2}$
Constraints	(1)	$K \leq \sum_{h} m_{h} / \sum_{h} m_{1 h}$		Selected sample size cannot exceed overall 1-in-K sample size
	(2)	$m_{h} / m_{1 h} \geq 1$		Domain subsample cannot exceed number of nonrespondents in the strata
	(3)	$m_{1 h} \geq 0$		Non-negativity constraint
	(4)	Not Applicable	$\begin{array}{l} \frac{r_{1 h} + q_{h} m_{h}}{n_{h}} < {URR}^{T} & K_{h} = 1 \\ \frac{r_{1 h}}{n_{h}} \geq {URR}^{T} & K_{h} = \infty \\ {URR}_{h}^{T} \geq {URR}^{T} & otherwise \end{array}$	Ensures that all domains achieve target URR as feasible.

ISSN : 1492-0921

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Use of this publication is governed by the Statistics Canada Open Licence Agreement.

Catalogue No. 12-001-X

Frequency: Semi-annual

Ottawa

Date modified:: 2018-06-21

Language selection

Search and menus

Search

Strategies for subsampling nonrespondents for economic programs
Section 2. Methodology

2.1 Survey design and estimation

2.2 Allocation strategies

Strategies for subsampling nonrespondents for economic programs Section 2. Methodology

2.1 Survey design and estimation

2.2 Allocation strategies

Editorial policy

Submission of Manuscripts

Note of appreciation

Standards of service to the public

Copyright

Strategies for subsampling nonrespondents for economic programs
Section 2. Methodology