Replication variance estimation after sample-based calibration
Section 2. Sample-based regression calibration

Table of contents

We consider a survey of a population $U$ with sample $s,$ weights $w_{i},$ target variables $y_{i} .$ For a given survey estimator $\hat{θ}$ constructed using the weights $w_{i},$ inference is conducted by replication, implemented through the provision of $R$ sets of replicate weights $w_{i}^{(r)}, r = 1, \dots, R,$ and variance estimation formula

$\hat{V} (\hat{θ}) = A \sum_{r =1}^{R} {({\hat{θ}}^{(r)} - \hat{θ})}^{2}, (2.1)$

where the ${\hat{θ}}^{(r)}$ are computed in the same manner as $\hat{θ}$ but replacing $w_{i}$ by the $w_{i}^{(r)} .$ The constant $A$ depends on the replication method. For simplicity, we focus in what follows on the Horvitz-Thompson estimator of $t_{y} = \sum_{U} y_{i},$ denoted by ${\hat{t}}_{y} = \sum_{s} y_{i} / π_{i} .$ In this case, many replication methods of the form (2.1) lead to a design consistent estimator of $Var ({\hat{t}}_{y}).$ We will refer to this survey as the “primary survey”.

We are interested in creating adjusted weights $w_{i}^{*}$ that are calibrated to a set of control totals from a secondary survey of $U$ with sample $s_{C},$ weights $w_{C i} .$ An estimator from this survey is denoted by ${\hat{θ}}_{C} .$ For the second survey, a replication-based variance estimator is also provided,

${\hat{V}}_{C} ({\hat{θ}}_{C}) = A_{C} \sum_{r =1}^{R_{C}} {({\hat{θ}}_{C}^{(r)} - {\hat{θ}}_{C})}^{2},$

with replicate weights $w_{C i}^{(r)}, r = 1, \dots, R_{C},$ and replication-specific constant $A_{C} .$ The control variables will be denoted by $x_{i},$ with estimated totals ${\hat{t}}_{C x} .$ Using regression estimation as a framework for calibration, the adjusted estimator is

${\hat{t}}_{y, reg} = {\hat{t}}_{y} + {({\hat{t}}_{C x} - {\hat{t}}_{x})}^{T} \hat{β} = \sum_{s} w_{i}^{*} y_{i} (2.2)$

where $\hat{β} = {(X_{s}^{T} W_{s} X_{s})}^{- 1} X_{s}^{T} W_{s} Y_{s}$ with $X_{s}$ a matrix with $i^{th}$ row equal to $x_{i}^{T},$ $W_{s}$ a diagonal matrix with $i^{th}$ entry $w_{i}$ and $Y_{s}$ a vector containing the $y_{i}, i \in s .$ Hence, the calibrated weights can be written as

$w_{i}^{*} = w_{i} (1 + {({\hat{t}}_{C x} - {\hat{t}}_{x})}^{T} {(X_{s}^{T} W_{s} X_{s})}^{- 1} x_{i}) . (2.3)$

We note that post-stratification is a special case of regression estimation, see Särndal, Swensson and Wretman (1992, Chapter 7.6).

To obtain a variance estimator, we follow the traditional linearization approach for regression estimators with respect to the sampling design (see e.g., Särndal et al., 1992, Chapter 5.5). Under mild regularity conditions (such as design consistency of Horvitz-Thompson estimators and invertibility of required matrices), the linearized version of the regression estimator (2.2) is equal to the difference estimator,

${\hat{t}}_{y, diff} = {\hat{t}}_{y} + {({\hat{t}}_{C x} - {\hat{t}}_{x})}^{T} β_{N} = {\hat{t}}_{C x}^{T} β_{N} + {\hat{t}}_{e} (2.4)$

where $β_{N} = {(X_{U}^{T} X_{U})}^{- 1} X_{U}^{T} Y_{U}$ is the population target of $\hat{β}$ and ${\hat{t}}_{e} = \sum_{s} w_{i} (y_{i} - x_{i}^{T} β_{N}) .$ The variance of ${\hat{t}}_{y, diff}$ is equal to

$Var ({\hat{t}}_{y, diff}) = Var ({\hat{t}}_{e}) + β_{N}^{T} Var ({\hat{t}}_{C x}) β_{N}, (2.5)$

since the two surveys are independent. This “linearized variance” is the variance of the asymptotic distribution of the regression estimator (2.2). In expression (2.5), the first variance term can be estimated using the replicates from the primary survey and the variance-covariance of the control totals in the second term can be estimated using the replicates from the secondary survey. Hence, the plug-in variance estimator

$\tilde{V} ({\hat{t}}_{y, reg}) = \hat{V} ({\hat{t}}_{\hat{e}}) + {\hat{β}}^{T} {\hat{V}}_{C} ({\hat{t}}_{C x}) \hat{β}, (2.6)$

where ${\hat{t}}_{\hat{e}} = \sum_{s} w_{i} (y_{i} - x_{i}^{T} \hat{β}),$ can be used for asymptotically valid inference for ${\hat{t}}_{y, reg} .$

However, it is often not practical to maintain the two datasets and associated sets of replicates for variance estimation purposes. In the context of survey calibration, the organization in charge of creating the adjusted weights for the primary survey would often prefer to continue providing their dataset unchanged except for the new calibrated weights and associated replicate weights, so that data users can perform their analyses using traditional survey tools. Hence, it is of interest to create a single set of replicates for the primary survey that can be used to estimate the variance, while accounting for the fact that the control totals are themselves estimated from a different survey.

We therefore propose to construct new replicates for the primary survey to estimate (2.5). Assume for now that $R_{C} \leq R .$ Starting from the replicate weights $w_{i}^{(r)}$ for the primary survey variance estimator, a replicate variance estimator of the first term in (2.6) is obtained by using the calibrated replicate weights

$w_{1 i}^{* (r)} = w_{i}^{(r)} (1 + {({\hat{t}}_{C x} - {\hat{t}}_{x}^{(r)})}^{T} {(X_{s}^{T} W_{s}^{(r)} X_{s})}^{- 1} x_{i}) . (2.7)$

These replicate weights are obtained by repeating the calibration for each of the replicate weights $w_{i}^{(r)}$ and lead to consistent variance estimation for regression estimators, as discussed for the general case in Fuller (2009, Chapter 4). See also Valliant (1993) for the special case of post-stratification.

The replicate weights $w_{1 i}^{* (r)}$ can be further modified to capture the second term in (2.6) as follows:

$w_{i}^{* (r)} = w_{1 i}^{* (r)} + a_{r} w_{i}^{(r)} {({\hat{t}}_{C x}^{(r)} - {\hat{t}}_{C x})}^{T} {(X_{s}^{T} W_{s}^{(r)} X_{s})}^{- 1} x_{i}, (2.8)$

with the constants $a_{r}$ to be further defined below. Combining (2.7) and (2.8), the resulting replicate weights are

$w_{i}^{* (r)} = w_{i}^{(r)} (1 + {({\hat{t}}_{C x} + a_{r} ({\hat{t}}_{C x}^{(r)} - {\hat{t}}_{C x}) - {\hat{t}}_{x}^{(r)})}^{T} {(X_{s}^{T} W_{s}^{(r)} X_{s})}^{- 1} x_{i}) . (2.9)$

These weights are again obtained by applying the same calibration as for the original weights to each of the replicates, but with replicate control totals ${\hat{t}}_{C x}^{* (r)} = {\hat{t}}_{C x} + a_{r} ({\hat{t}}_{C x}^{(r)} - {\hat{t}}_{C x}).$ The resulting replicate estimates are

${\hat{t}}_{y, reg}^{(r)} = \sum_{s} w_{i}^{* (r)} y_{i} = {\hat{t}}_{y}^{(r)} + {({\hat{t}}_{C x} - {\hat{t}}_{x}^{(r)})}^{T} {\hat{β}}^{(r)} + a_{r} {({\hat{t}}_{C x}^{(r)} - {\hat{t}}_{C x})}^{T} {\hat{β}}^{(r)} .$

The constants $a_{r}$ are chosen to account for the difference between the primary and control replication methods, in particular between $R_{C}$ and $R$ and $A_{C}$ and $A,$ by letting

$a_{r} = {\begin{array}{l} \sqrt{\frac{A_{C}}{A}} & r = 1, \dots, R_{C} \\ 0 & r = R_{C} + 1, \dots, R . \end{array} (2.10)$

This implies that for $r > R_{C},$ the replicate weights $w_{i}^{* (r)} = w_{1 i}^{* (r)}$ in (2.8), i.e. the unadjusted control totals are used to calibrate the replicate weights. While the $a_{r}$ are written with the first $R_{C}$ values non-zero, this is for notational convenience only. The assignment of the replicates from the control survey to those of the primary survey should be randomized, to ensure that estimators and replicate estimators from both surveys remain independent regardless of the replication methods.

Using the replicate weights (2.9) with constants (2.10), the replicate variance estimator (2.1) becomes

$\hat{V} ({\hat{t}}_{y, reg}) = A \sum_{r =1}^{R} {({\hat{t}}_{y, reg}^{(r)} - {\hat{t}}_{y, reg})}^{2}, (2.11)$

Ignoring terms of smaller order as well as those with $a_{r} = 0,$ this is approximately equal to

$\begin{array}{l} \hat{V} ({\hat{t}}_{y, reg}) & \approx A \sum_{r =1}^{R} {({\hat{t}}_{\hat{e}}^{(r)} - {\hat{t}}_{\hat{e}})}^{2} + {\hat{β}}^{T} A_{C} \sum_{r =1}^{R_{C}} ({\hat{t}}_{C x}^{(r)} - {\hat{t}}_{C x}) {({\hat{t}}_{C x}^{(r)} - {\hat{t}}_{C x})}^{T} \hat{β} \\ + A \sum_{r =1}^{R} a_{r} ({\hat{t}}_{\hat{e}}^{(r)} - {\hat{t}}_{\hat{e}}) {({\hat{t}}_{C x}^{(r)} - {\hat{t}}_{C x})}^{T} \hat{β} . \end{array}$

The cross-term is likewise of smaller order because of the independence of the two surveys and the fact that $\sum_{r =1}^{R} {\hat{t}}_{\hat{e}}^{(r)} / R \approx {\hat{t}}_{\hat{e}}$ and $\sum_{r =1}^{R_{C}} {\hat{t}}_{C x}^{(r)} / R_{C} \approx {\hat{t}}_{C x} .$ Hence, the replicate variance estimator (2.11) inherits the design consistency of the original replication methods for both surveys and is design consistent for the linearized variance (2.5).

Finally, we discuss the case when $R_{C} > R .$ The above approach is readily extended to this case by repeating the $R$ replicates of the primary survey $K$ times, such that $R_{C} \leq K R$ with $K$ the smallest positive integer for which this inequality is satisfied. The resulting replicate variance estimator is of the same form as (2.1) but with $R$ replaced by $K R$ and $A$ is replaced by $A / K .$ Then, the method discussed above applies directly to this new replicate variance estimator for the primary survey. For instance, if $R = 120$ and $R_{C} = 150,$ each replicate in the primary survey will be repeated $K = 2$ times, leading to 240 replicates for the primary survey of which 150 will be modified.

ISSN : 1492-0921

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Use of this publication is governed by the Statistics Canada Open Licence Agreement.

Catalogue No. 12-001-X

Frequency: Semi-annual

Ottawa

Date modified:: 2022-01-06

Language selection

Search and menus

Search

Replication variance estimation after sample-based calibration
Section 2. Sample-based regression calibration

Replication variance estimation after sample-based calibration Section 2. Sample-based regression calibration

Editorial policy

Submission of Manuscripts

Note of appreciation

Standards of service to the public

Copyright

Replication variance estimation after sample-based calibration
Section 2. Sample-based regression calibration