Multiple-frame surveys for a multiple-data-source world
Section 3. Estimation in classical multiple-frame surveys

Table of contents

The main problem for inference in a classical multiple-frame survey $-$ one that is designed so as to satisfy Assumptions (A1) to (A6) $-$ is how to account for potential overlap among the samples. In the NSAF, telephone households were screened out of the area sample, but in many applications screening is infeasible or it is more cost-effective to obtain data from the full sample selected from each frame. When separate surveys or data sources are not designed with data combination in mind, the overlap depends on the coverage of the individual data sources.

With an overlap design, units that are contained in more than one frame have multiple chances for being selected in the sample. An estimator constructed by summing the weighted observations from each of the $Q$ samples,

${\hat{Y}}_{concat} = \sum_{q =1}^{Q} \sum_{i \in S_{q}} w_{i}^{(q)} y_{i},$

will be a biased estimator of $Y = \sum_{i =1}^{N} y_{i}$ because the individual sample weights do not reflect the multiple chances of selection for units in overlap domains. Methods for estimating population totals thus typically multiply the survey weights $w_{i}^{(q)}$ by a multiplicity adjustment $m_{i}^{(q)}$ that satisfies $\sum_{q =1}^{Q} δ_{i}^{(q)} m_{i}^{(q)} \approx 1$ for each unit $i,$ resulting in the estimator

$\hat{Y} = \sum_{q =1}^{Q} \sum_{i \in S_{q}} w_{i}^{(q)} m_{i}^{(q)} y_{i} = \sum_{q =1}^{Q} \sum_{i \in S_{q}} {\tilde{w}}_{i}^{(q)} y_{i}, (3.1)$

where ${\tilde{w}}_{i}^{(q)} = w_{i}^{(q)} m_{i}^{(q)}$ is the multiplicity-adjusted weight.

3.1 Hartley’s composite estimator

Hartley (1962) was the first author to present a rigorous theory of estimation in dual-frame surveys where units in the overlap domain {1, 2} might be sampled from both frames. This four-page paper made several important contributions. First, Hartley defined the problem in statistical terms. Second, he proposed an optimal estimator for combining the estimates from the two surveys. And third, he studied the design problem of allocating the resources to the different samples, with a joint consideration of the allocation and the estimator that minimize the variance of the estimated population total subject to a fixed cost.

Hartley (1962) estimated the population total $Y = \sum_{i =1}^{N} y_{i}$ by

$\hat{Y} (θ) = {\hat{Y}}_{{1}}^{(1)} + {\hat{Y}}_{{2}}^{(2)} + θ {\hat{Y}}_{{1, 2}}^{(1)} + (1 - θ) {\hat{Y}}_{{1,2}}^{(2)} . (3.2)$

He proposed choosing $θ$ to minimize $V [\hat{Y} (θ)] .$ This resulted in the value

$θ_{H} = \frac{V ({\hat{Y}}_{{1, 2}}^{(2)}) + Cov ({\hat{Y}}_{{2}}^{(2)}, {\hat{Y}}_{{1, 2}}^{(2)}) - Cov ({\hat{Y}}_{{1}}^{(1)}, {\hat{Y}}_{{1, 2}}^{(1)})}{V ({\hat{Y}}_{{1, 2}}^{(1)}) + V ({\hat{Y}}_{{1, 2}}^{(2)})} . (3.3)$

The estimator in (3.2) is of the form in (3.1) with multiplicity weight adjustments

$m_{i}^{(1)} = δ_{i} ({1}) + δ_{i} ({1,2}) θ, m_{i}^{(2)} = δ_{i} ({2}) + δ_{i} ({1, 2}) (1 - θ) .$

If it is desired to use the optimal compositing factor $θ_{H},$ estimators may be substituted for the unknown covariances in (3.3). Because $θ_{H}$ depends on covariances involving $y,$ however, the optimal multiplicity adjustment may differ for different variables, giving a different set of weights for each. In addition, $θ_{H}$ can be less than 0 or greater than 1, possibly resulting in negative weights for some observations. These features carry over to the $Q$ -frame generalization of Hartley’s optimal estimator studied by Lohr and Rao (2006).

The estimator in (3.2), with fixed value of $θ,$ is approximately unbiased for $Y$ under Assumption (A5). If the estimated domain totals and the estimates of the covariances in (3.3) are consistent, then the estimator with ${\hat{θ}}_{H}$ is consistent for $Y .$ Saegusa (2019) studied Hartley’s estimator from the perspective of empirical process theory, establishing a law of large numbers and a central limit theorem when $S_{1}$ and $S_{2}$ are both simple random samples.

Hartley’s application was in agriculture, and many of the early applications of dual-frame surveys were for agriculture or business surveys (Kott and Vogel, 1995), where list frames existed that contained the larger business or agricultural operations. A dual-frame survey with a disproportionately larger sample from the list frame reduced costs because (1) obtaining data from an operation in the list frame was often less expensive than obtaining data from an operation in the area frame and (2) oversampling the list frame was analogous to oversampling high-variance strata in stratified sampling and thus resulted in greater efficiency.

Later, as cellular telephones became more prevalent, concern about bias from using landline telephone samples alone led to use of dual-frame telephone surveys, with one sample from a landline frame and a second sample from a cellular telephone frame. Here, both frames are incomplete but together cover the population of persons with telephones. For these surveys, an important consideration is how to deal with persons having both kinds of telephones. The next section reviews choices for the compositing.

3.2 Multiplicity weighting adjustments

Hartley’s optimal estimator, with $θ_{H},$ uses a different set of weights for each response variable, which can lead to internal inconsistencies among estimators. Various authors have proposed estimators that use a single set of weights for all analyses. Here, I briefly list some of the multiplicity adjustment factors $m_{i}^{(q)}$ that result in one set of weights for the general estimator of the population total in (3.1). The resulting estimators are approximately unbiased for the population total $Y$ under Assumptions (A1), (A4), and (A5). These and additional estimators are reviewed in detail by Lohr (2011), Lu, Peng and Sahr (2013), Ferraz and Vogel (2015), Arcos, Rueda, Trujillo and Molina (2015), and Baffour, Haynes, Western, Pennay, Misson and Martinez (2016).

Screening estimator, with $m_{i}^{(1)} = 1, m_{i}^{(2)} = 1 - δ_{i}^{(1)}, \dots, m_{i}^{(Q)} = \prod_{q =1}^{Q - 1} (1 - δ_{i}^{(q)}) .$ A unit sampled from Frame $q$ is discarded if it is in any of Frames $1, \dots, q - 1.$ This estimator is automatically used with a screening design such as the NSAF; with an overlap design, its use means that some data observations are thrown away.
Multiplicity estimator, with $m_{i}^{(q)} = 1 /$ (number of frames containing unit $i)$ $= 1 / \sum_{q =1}^{Q} δ_{i}^{(q)} .$ In a dual-frame survey, this gives the estimator in (3.2) with $θ = 1 / 2 .$ Mecatti (2007) noted that with the multiplicity estimator, Assumption (A4) can be replaced by the slightly less restrictive assumption that $\sum_{q =1}^{Q} δ_{i}^{(q)}$ is known for each sampled unit $i .$
The multiplicity estimator can also be viewed as a special case of the generalized weight share method (Deville and Lavallée, 2006) using the standardized link matrix, since the number of links to population unit $i$ is the number of frames containing that unit.
Single-frame estimator (Bankier, 1986; Kalton and Anderson, 1986), which considers the observations as if they had been sampled from a single frame. If inverse probability weights are used, with $w_{i}^{(q)} = 1 / π_{i}^{(q)},$ then $m_{i}^{(q)} = π_{i}^{(q)} / \sum_{f =1}^{Q} δ_{i}^{(f)} π_{i}^{(f)} .$ This estimator requires that the inclusion probability for unit $i$ be known for all $Q$ frames, including frames from which the unit was not sampled. The multiplicity adjustments consider the inclusion probabilities for the designs but not the relative variances, which are affected by clustering and stratification in the individual samples.
Effective sample size (ESS) estimator (Chu, Brick and Kalton, 1999; O’Muircheartaigh and Pedlow, 2002), where the domain estimator from each frame is weighted by the relative effective sample size from that frame. Let $n^{(q)}$ be the sample size from Frame $q$ and let ${deff}^{(q)}$ denote the design effect for a key variable or a smoothed design effect for multiple variables. The effective sample size for $S_{q}$ is ${\tilde{n}}^{(q)} = n^{(q)} / {deff}^{(q)}$ and the multiplicity adjustment for unit $i$ is

$m_{i}^{(q)} = \frac{{\tilde{n}}^{(q)}}{\sum_{f =1}^{Q} δ_{i}^{(f)} {\tilde{n}}^{(f)}} .$

This estimator considers the relative variances of estimators from different samples and is often more efficient than the screening, multiplicity, and single-frame estimators.
The pseudo-maximum-likelihood (PML) estimator of Skinner and Rao (1996) is of this type when the frame sizes $N^{(q)}$ and domain sizes $N_{d}$ are unknown; Skinner and Rao (1996) recommended using the design effect for estimating $N_{{1, 2}}$ to establish the effective sample size for the dual-frame case. The PML estimator is asymptotically equivalent to an ESS estimator that poststratifies to the domain sizes $N_{d}$ when those are known; when the frame sizes $N^{(q)}$ are known but not $N_{{1, 2}},$ the PML estimator is asymptotically equivalent to calibrating the ESS estimator to estimated domain sizes calculated from the pseudo-likelihood function.

Approximately unbiased estimates of the variances for all estimators considered in this section can be derived under Assumptions (A1) to (A6) and additional regularity conditions that ensure consistency of estimated totals and variance estimators from the $Q$ samples. Skinner and Rao (1996) studied linearization variance estimators; Chauvet (2016) derived linearization variance estimators for the French housing survey that accounted for the variance reduction due to high sampling fractions from some of the frames. Lohr and Rao (2000) developed theory for using the jackknife with multiple frames, and Lohr (2007) and Aidara (2019) considered bootstrap variance estimators. These methods rely on Assumption (A3) of independent samples; Chauvet and de Marsac (2014) considered the situation in which the samples share primary sampling units but independent samples are taken at the second stage of the design.

Calculating linearization variance estimates requires special software that implements the partial derivative calculations for the multiple frames. Replication variance estimation methods such as jackknife and bootstrap, however, can be calculated in standard survey software by creating a single data set that contains all the concatenated observations and weights ${\tilde{w}}_{i}^{(q)}$ from the $Q$ samples and creating replicate weights using standard methods for stratified multistage samples (Metcalf and Scott, 2009). The concatenated data set has $\sum_{q =1}^{Q} H_{q}$ strata, where $H_{q}$ is the number of strata for $S_{q};$ observations from different samples are in different strata. The replicate weight methods also can include effects of calibration (see Section 3.3) on the variance.

Of course, many applications call for estimates of quantities other than population totals, and the multiple-frame theory applies to parameters that are smooth functions of domain totals. A different compositing factor may be desired when quantities other than population totals are of primary interest, however, and there may be special considerations for other types of analyses. Other types of statistical analyses that have been studied in the multiple-frame setting include linear (Lu, 2014b) and nonparametric (Lu, Fu and Zhang, 2021) regression, logistic regression with ordinal data (Rueda, Arcos, Molina and Ranalli, 2018), empirical distribution functions (Arcos, Martínez, Rueda and Martínez, 2017), gross flow estimation with missing data (Lu and Lohr, 2010), and chi-squared tests (Lu, 2014a).

Lu (2014b) noted that linear regression parameters estimated using the multiplicity-adjusted weights are the finite population regression coefficients $B$ that minimize the sum of squares $\sum_{i =1}^{N} {(y_{i} - x_{i}^{T} B)}^{2} .$ However, one of the reasons for taking a multiple-frame survey, rather than using an incomplete frame, is a concern that population characteristics may differ across domains. Lu (2014b) suggested examining the residuals separately by domain and also fitting separate regression models by domain to assess the appropriateness of the regression model.

3.3 Calibration

The PML estimator is calibrated to population counts that are known for the frames and domains. In a dual-frame survey where $N^{(1)}$ and $N^{(2)}$ are known, $\sum_{q =1}^{2} \sum_{i \in S_{q}} w_{i}^{(q)} m_{i, PML}^{(q)} δ_{i}^{(f)} = N^{(f)}$ for $f = 1, 2.$ If the overlap domain size $N_{{1, 2}}$ is also known, the PML estimator is calibrated to all three domain sizes. Skinner (1991) used calibration with the single-frame estimator, raking the estimator to the population frame counts.

Ranalli, Arcos, Rueda and Teodoro (2016) studied general calibration theory for dual-frame surveys. They assumed that a vector of auxiliary information $x$ is available with known population totals $X = \sum_{i =1}^{N} x_{i},$ and calculated multiple-frame generalized regression weights as

$c_{i}^{(q)} = {\tilde{w}}_{i}^{(q)} [1 + {(X - \hat{X})}^{T} {(\sum_{f =1}^{Q} \sum_{k \in S_{f}} α_{k} {\tilde{w}}_{k}^{(f)} x_{k} x_{k}^{T})}^{- 1} α_{i} x_{i}], (3.4)$

where $α_{k}$ is an arbitrary constant and $\hat{X} = \sum_{f =1}^{Q} \sum_{k \in S_{f}} {\tilde{w}}_{k}^{(f)} x_{k}$ estimates $X$ using the multiplicity-adjusted weights. Under regularity conditions, they showed that for the dual-frame estimator in (3.2) with fixed $θ,$ the variance of the generalized regression estimator ${\hat{Y}}_{GR} = \sum_{q =1}^{2} \sum_{i \in S_{q}} c_{i}^{(q)} y_{i}$ is approximated by

$V ({\hat{Y}}_{GR}) \approx V [\sum_{q =1}^{2} \sum_{i \in S_{q}} {\tilde{w}}_{i}^{(q)} (y_{i} - x_{i}^{T} B)], (3.5)$

where $B = {(\sum_{i =1}^{N} α_{i} x_{i} x_{i}^{T})}^{- 1} \sum_{i =1}^{N} α_{i} x_{i} y_{i} .$ The variance of the estimator depends on the residuals from the regression model just as in the single-frame case.

Särndal and Lundström (2005) distinguished among types of auxiliary information that can be used in calibration. InfoU is information available at the population level. A vector $x^{*}$ can be considered as InfoU if the population total $X^{*} = \sum_{i =1}^{N} x_{i}^{*}$ is known and $x^{*}$ is observed for every respondent in the sample. InfoS is information available at the level of the sample, but not at the population level. Vector $x^{o}$ is InfoS if it is known for every member of the sample, both respondents and nonrespondents, but $\sum_{i =1}^{N} x^{o}$ is unknown.

In a multiple-frame survey, the variables available for InfoU and InfoS may differ across frames. For the NSAF, little auxiliary information was known for nonrespondents in the RDD sample but address-related information (for example, characteristics of the block group) was known for all members of the area-frame sample. The reverse may be true for a dual-frame survey in which Frame 1 is an area frame and Frame 2 is a list frame. The list frame may have rich information that can be used for weighting class adjustments or calibration, while the auxiliary information for the area frame may be restricted to information measured in the survey for which population totals are known from an external source such as a census or population register.

Ranalli et al. (2016) allowed for differing InfoU information across the frames; some of the auxiliary variables may be known for units from all samples and for the full population, while other variables may be of the form $x_{i}^{*} = x_{i} δ_{i}^{(q)}$ with total $X^{*} = \sum_{i =1}^{N} x_{i} δ_{i}^{(q)},$ the total of variable $x$ in Frame $q .$ Calibration to frame counts $N^{(q)}$ is thus a special case of the general calibration theory.

But the differing amounts of information for the frames may also have a bearing on the multiplicity adjustments. Suppose that Frame 2 has rich auxiliary information for calibration while Frame 1 has little information. Calibrating the weights $w_{i}^{(2)}$ before compositing may increase the relative effective sample size from $S_{2}$ and thus increase the value of ${\tilde{n}}^{(2)} / ({\tilde{n}}^{(1)} + {\tilde{n}}^{(2)})$ that would be used for the ESS estimator.

Haziza and Lesage (2016) argued that a two-step weighting procedure offers several advantages for single-frame surveys with nonresponse. The first step divides the design weight for unit $i$ by its estimated response propensity (often calculated from InfoS information) and the second step calibrates the nonresponse-adjusted weights to population control totals (available from InfoU information). When there is substantial nonresponse, weighting adjustment factors from step 1 are often much higher than those from step 2; if the response propensity model is correct, the weighting adjustments in step 2 converge to 1 as $n \to \infty .$ The two-step procedure is thus more robust toward misspecification of the calibration model.

The same considerations apply for multiple-frame surveys. A two-step procedure, where step 1 adjusts the samples separately for nonresponse and step 2 calibrates the combined samples, provides robustness to the calibration model. Suppose that $S_{1}$ has full response; $S_{2}$ has nonresponse but the response propensities can be predicted perfectly from variable $x .$ Then, performing a separate nonresponse adjustment for each sample in step 1 removes the bias for $S_{2}$ so that Assumption (A5) is satisfied. If the data are combined first and then calibrated using (3.4), however, the calibration may change the weights for units in $S_{1}$ in order to meet the calibration constraints $-$ introducing bias for the estimates from $S_{1}$ while not removing it for estimates from $S_{2} .$ More research is needed on the ordering of steps for weight adjustments. It may be better to perform two steps of nonresponse adjustments and calibration on each sample separately, then adjust the weights for multiplicity, and then calibrate to population totals (including re-calibrating on the individual frame variables).

One consequence of using an overlap estimator for a multiple-frame survey is that the multiplicity adjustments may introduce more weight variation, with observations belonging to one frame having much larger weights than observations belonging to more than one frame. If, for example, a list frame (Frame 2 in Figure 2.2(a, b)) is disproportionately oversampled, then the sampling weights for observations in domain ${1}$ which are sampled only from Frame 1, may be large relative to the weights for the other domains. Wolter, Ganesh, Copeland, Singleton and Khare (2019) suggested using a shrinkage estimator, estimating $Y_{{1}}$ by $κ {\hat{Y}}_{{1}}^{(1)} + (1 - κ) N_{{1}} ({\hat{Y}}_{{2}}^{(2)} + {\hat{Y}}_{{1, 2}}) / N^{(2)},$ but the shrinkage may introduce bias $-$ after all, the reason for using a more complicated multiple-frame design instead of just sampling from Frame 2 is to avoid potential bias from omitting domain ${1}$ . A better solution, if feasible, is to address the weight variation when designing the survey, as discussed in Section 5.

3.4 Probability sample combined with census of a population subset

Lohr (2014) and Kim and Tam (2021) noted that the situation in Figure 2.2(a) includes the special case in which a probability sample $S_{1}$ is taken from Frame 1 having full coverage, and the sample $S_{2}$ from Frame 2 is a census of domain {1, 2}. The overlap domain is thus defined to be the units in $S_{2},$ which may be from administrative records or a convenience sample. Although $S_{2},$ considered by itself, may have undercoverage bias, in the multiple-frame setting the bias is eliminated by the presence of a sample from Frame 1. The units in $S_{2}$ have $w_{i}^{(2)} = 1$ and represent themselves alone; they do not represent any units in other parts of the population. When $N^{(2)} / N$ is small, say from a small convenience sample, $S_{2}$ will have little effect on dual-frame estimators $-$ almost all of the population is in domain ${1}$ . But when $N^{(2)} / N$ is large, as may occur when Frame 2 consists of administrative records, the availability of those records may improve the precision of $\hat{Y}$ if Assumptions (A1) to (A6) are met.

When $S_{2}$ is a census with no measurement error, ${\hat{Y}}_{{1, 2}}^{(2)} = Y_{{1, 2}} .$ The estimator in (3.2) is

$\hat{Y} (θ) = {\hat{Y}}_{{1}}^{(1)} + θ {\hat{Y}}_{{1, 2}}^{(1)} + (1 - θ) Y_{{1, 2}}; (3.6)$

taking $θ = 0$ uses the known population total from Frame 2 and relies on Frame 1 only for estimation of the part of the population not in Frame 2.

Kim and Tam (2021) noted that since $Y_{{1, 2}}$ is known, it can be used as an InfoU calibration total. They proposed two calibration estimators: a ratio estimator ${\hat{Y}}_{ratio} = {\hat{Y}}^{(1)} Y_{{1, 2}} / {\hat{Y}}_{{1, 2}}^{(1)}$ and a generalized regression calibration estimator. For many designs, however, the ratio estimator will be less efficient than $\hat{Y} (0)$ from (3.6) because

$V ({\hat{Y}}_{ratio}) \approx V ({\hat{Y}}_{{1}}^{(1)}) + {(\frac{Y_{{1}}}{Y_{{1, 2}}})}^{2} V [{\hat{Y}}_{{1, 2}}^{(1)}] - 2 \frac{Y_{{1}}}{Y_{{1, 2}}} Cov ({\hat{Y}}_{{1}}^{(1)}, {\hat{Y}}_{{1, 2}}^{(1)});$

the ratio adjustment can introduce extra variability from ${\hat{Y}}_{{1, 2}}^{(1)}$ that is excluded from $\hat{Y} (0).$

Calibrating $\hat{Y} (θ)$ to $Y_{{1, 2}} = \sum_{i =1}^{N} x_{i},$ for $x_{i} = δ_{i}^{(2)} y_{i},$ the generalized regression weights in (3.4) become

$c_{i}^{(q)} = {\tilde{w}}_{i}^{(q)} [1 + (Y_{{1, 2}} - {\hat{Y}}_{{1, 2}} (θ)) {(\sum_{f =1}^{Q} \sum_{k \in S_{f}} {\tilde{w}}_{k}^{(f)} δ_{k}^{(2)} y_{k}^{2})}^{- 1} δ_{i}^{(2)} y_{i}], (3.7)$

resulting in ${\hat{Y}}_{GR} = \hat{Y} (0)$ from (3.6). Similarly, calibrating on the vector $x_{i} = {(1, δ_{i}^{(2)}, δ_{i}^{(2)} y_{i})}^{T}$ results in ${\hat{Y}}_{GR} = {\hat{Y}}_{{1}}^{(1)} N_{{1}} / {\hat{N}}_{{1}}^{(1)} + Y_{{1, 2}} .$

For some designs, the variance can be reduced even further. Montanari (1987, 1998) proposed using the regression coefficient $β = {[V (\hat{X})]}^{- 1} Cov (\hat{Y}, \hat{X})$ for calibration, resulting in the estimator

${\hat{Y}}_{opt} = \hat{Y} + {(X - \hat{X})}^{T} β . (3.8)$

Rao (1994) called (3.8) the optimal regression estimator and showed that $V ({\hat{Y}}_{opt}) \leq V ({\hat{Y}}_{GR}).$ For the dual-frame situation considered in this section, with $x_{i} = δ_{i}^{(2)} y_{i},$

$β = \frac{Cov ({\hat{Y}}^{(1)}, {\hat{Y}}_{{1, 2}}^{(1)})}{V ({\hat{Y}}_{{1, 2}}^{(1)})} = 1 + \frac{Cov ({\hat{Y}}_{{1}}^{(1)}, {\hat{Y}}_{{1, 2}}^{(1)})}{V ({\hat{Y}}_{{1, 2}}^{(1)})}$

and

$\begin{array}{l} {\hat{Y}}_{opt} & = {\hat{Y}}^{(1)} + (Y_{{1, 2}} - {\hat{Y}}_{{1, 2}}^{(1)}) [1 + \frac{Cov ({\hat{Y}}_{{1}}^{(1)}, {\hat{Y}}_{{1, 2}}^{(1)})}{V ({\hat{Y}}_{{1, 2}}^{(1)})}] \\ = {\hat{Y}}_{{1}}^{(1)} + θ_{H} {\hat{Y}}_{{1, 2}}^{(1)} + (1 - θ_{H}) Y_{{1, 2}}, (3.9) \end{array}$

where $θ_{H} = - Cov ({\hat{Y}}_{{1}}^{(1)}, {\hat{Y}}_{{1, 2}}^{(1)}) / V ({\hat{Y}}_{{1, 2}}^{(1)})$ is Hartley’s optimal value for $θ$ from (3.3).

Although we usually think of the compositing factor $θ$ as being between 0 and 1, $θ_{H}$ can be outside of this range. For a conceptual example, suppose that Frame 2 is a list of children receiving food assistance at school and the sample from Frame 1 is a cluster sample of households. Then households in which one or more children are receiving food assistance have some household members in domain ${1, 2}$ and other members in domain ${1}$ . If $y$ exhibits high intra-household correlation, then we would expect ${\hat{Y}}_{{1}}^{(1)}$ and ${\hat{Y}}_{{1, 2}}^{(1)}$ to be positively correlated. In this case, Hartley’s optimal estimator results in negative weights for units in domain ${1, 2}$ from the probability sample.

Even though ${\hat{Y}}_{opt}$ is more efficient for special situations such as the cluster sample described above, it depends in practice on an estimate of the covariance, is optimal only for this particular $y$ variable, and may have negative weights. Negative weights can also occur if one does optimal calibration with auxiliary variable $(1, δ_{i}^{(2)}, δ_{i}^{(2)} y_{i});$ in fact, that calibration results in the estimator proposed by Fuller and Burmeister (1972). These optimal regression estimators are sensitive to the model assumptions, and in general I do not recommend their use.

When the Frame-2 sample is a census and Assumptions (A1) to (A6) are met, the precision of population estimates depends entirely on the design of $S_{1} .$ When the samples are not designed to be part of a multiple-frame survey (and sometimes even when they are), it is likely that one or more of the assumptions is violated. Assumptions (A4) and (A6) are particularly suspect when it is desired to combine data from surveys that were not designed with combination in mind. Even if both surveys measure unemployment, they may use different questions so that the unemployment statistics from $S_{2}$ measure a different concept than the statistics from $S_{1} .$ Domain misclassification may also occur. A unit in the census $S_{2}$ is known to also be in complete Frame 1, but it may be difficult to tell whether a unit in $S_{1}$ is also in the administrative records or convenience sample that serves as $S_{2} .$ These problems are discussed in the next section.

ISSN : 1492-0921

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Use of this publication is governed by the Statistics Canada Open Licence Agreement.

Catalogue No. 12-001-X

Frequency: Semi-annual

Ottawa

Date modified:: 2022-01-06

Language selection

Search and menus

Search

Multiple-frame surveys for a multiple-data-source world
Section 3. Estimation in classical multiple-frame surveys

3.1 Hartley’s composite estimator

3.2 Multiplicity weighting adjustments

3.3 Calibration

3.4 Probability sample combined with census of a population subset

Multiple-frame surveys for a multiple-data-source world Section 3. Estimation in classical multiple-frame surveys

3.1 Hartley’s composite estimator

3.2 Multiplicity weighting adjustments

3.3 Calibration

3.4 Probability sample combined with census of a population subset

Editorial policy

Submission of Manuscripts

Note of appreciation

Standards of service to the public

Copyright

Multiple-frame surveys for a multiple-data-source world
Section 3. Estimation in classical multiple-frame surveys