Replication variance estimation after sample-based calibration
Section 4. Application
We return to the calibration problem encountered while
bridging the two 2016 FHWAR surveys. For both surveys, the population is
defined as individuals of ages 16 and older, living in U.S. households. The
main data sources for this application are record-level data files, containing
weights and replicate weights for both surveys. Using these datasets, we
conducted an initial analysis and identified discrepancies in the demographics,
which we adjusted by sample-based raking. Estimated population totals
constructed using the record-level data from the National survey were
considered as random controls, for the crosstabs of census divisions (nine
categories) and each of the following demographic variables:
- residency: two categories corresponding to
urban and rural classification,
- age: eight categories corresponding to age
ranges 16-17; 18-24; 25-34; 35-44; 45-54; 55-64; 65-74, and 75+,
- sex: two categories corresponding to male and
female,
- race-ethnicity: four categories corresponding
to Hispanic, non-Hispanic White, non-Hispanic African American, and
non-Hispanic all other,
- annual income: nine categories corresponding
to income ranges -$20,000; $20,000-$29,999; $30,000-$39,999;
$40,000-$49,999; $50,000-$74,999; $75,000-$99,999; $100,000-$149,999; $150,000+,
and not reported.
For the application in this article, we use the 50-State
survey public use file, which does not contain information on income.
Therefore, we illustrate the proposed method in a slightly simplified setting
here, using the crosstabs of census divisions and residency, age, sex, and
race-ethnicity as the raking dimensions. We implemented both the Fuller (1998)
method and the proposed calibration method described in Section 2 using
the public-use data files available for both surveys. For comparison, we also
show the results of calibrating without adjusting the variance estimates for
the random controls, referred to below as the “naive” method because it ignores
the variability of the controls in the variance estimates. To compare the
variance estimation methods for survey variables that are not control
variables, we will also show estimates for domains defined by crosstabs of
residency and sex.
While the replication methods of the two FHWAR surveys
are different, they both use
replicates. Referring to expression (2.1), the
replication constant for the DAGJK method of the 50-State survey is
and the corresponding constant for the SDR
method of the National survey is
both available from their respective survey
documentation. Hence, the replication adjustment constants
in (2.10) are equal to
for
The estimates we will consider are all estimated domain
counts, so we define the target variable
for a domain of interest
For the 144 domains defined by the raking
dimensions, we write the estimated domain counts as
Likewise, the control totals are estimated
domain counts from the National survey, so the auxiliary variable vector is
a vector of length 144 containing the
indicators for inclusion of respondent
in the control domains
and let
We denote the vector of control totals as
and the adjusted replicate control totals are
In order to implement the Fuller (1998) method, we
estimated the variance-covariance matrix of the control totals
using the National survey replicate weights.
The spectral decomposition of this matrix resulted in a set of 144 eigenvectors
and associated eigenvalues
for
Following Fuller (1998), we obtain a set of
144 vectors
satisfying
where
for
Finally, the adjusted replicate controls are
for
and
for
This points to a drawback of the Fuller (1998)
method: while our approach perturbs the control totals of all 160 replicates,
this method only perturbs a fraction of them in this case. In addition, 30 of
the 144 eigenvalues were nearly zero, 18 of which less than zero due to
floating point error. We truncated the 18 negative eigenvalues to zero, and
left the rest unchanged. Hence, to the extent that not all replicates
contribute to variance estimates for some survey estimates (e.g., domain
totals), there is a risk that the sample-based calibration will be imperfectly
reflected in the variance estimates. In general, we expect that a larger number
of replicates will be perturbed using our approach, since the estimated
variance-covariance matrix of the control totals can only be reliably estimated
if its dimension is suitably smaller than
Tables 4.1 and 4.2 show the estimates and standard
errors, respectively, for domains defined by residency and sex, before and
after calibration. The first four rows contain the results for marginal totals
for raking variables, which are exactly calibrated, while the last four are
totals that correspond to the intersection of raking dimensions and are
therefore not exactly calibrated.
Both surveys are representative of the same target
population, but the estimates and associated standard errors differ, reflecting
both sampling variability as well as different calibration approaches applied
by the two survey organizations. As Table 4.1 confirms, after the 50-state
survey is raked to the National survey, the estimated totals for domains
defined as exact calibration domains indeed match exactly between both surveys.
For the domains defined by the crosstabulation of residency and sex, the raked
estimates for the 50-State survey are close but not identical to those of the
National survey.
Table 4.1
Population estimates before and after calibration, rounded to the nearest integer, after scaling by
Table summary
This table displays the results of Population estimates before and after calibration. The information is grouped by Domain (appearing as row headers), Before Calibration and After Calibration (appearing as column headers).
| Domain |
Before Calibration |
After Calibration |
| 50-State |
National |
| Residency: |
Urban |
203,445 |
208,695 |
208,695 |
| Rural |
51,511 |
45,991 |
45,991 |
| Sex: |
Male |
128,276 |
121,775 |
121,775 |
| Female |
126,680 |
132,911 |
132,911 |
| Rural: |
Male |
99,547 |
98,511 |
98,089 |
| Female |
103,898 |
110,184 |
110,607 |
| Urban: |
Male |
28,729 |
23,264 |
23,686 |
| Female |
22,782 |
22,727 |
22,305 |
Table 4.2 shows the standard errors obtained by the
two replication methods with adjusted control totals and by the naive method,
which does not account for the randomness of the control totals. By
construction, the proposed replication-based adjustment method and the Fuller
(1998) method lead to identical variance estimates for domains that are used in
the calibration. These variance estimates are also equal to those from the
control survey in this case. This reflects the fact that the variance component
corresponding to the first term in (2.5) is set to 0 for the control totals,
while the variance component for the second term is exactly equal to the
control survey variance estimate in the case of raking. Because that variance
component is ignored in the naive method, the variance estimates are equal to
zero. For the estimated totals for domains defined as the crosstabulation of
residency and sex, the variance estimates of the two methods are not identical
but close (within 8% of each other), reflecting the fact that both are
consistent for the asymptotic variance (2.5). The variance estimates under the
naive method are smaller than the variance estimates under the other two
calibration methods, leading to an obviously incorrect result due to not
accounting for the variance in the random control totals. For other variables,
the variance is still expected to be underestimated under the naive method, due
to the fact that the second term in the asymptotic variance (2.5) is ignored.
Table 4.2
Standard errors of population estimates before and after calibration, rounded to the nearest integer, after scaling by
Table summary
This table displays the results of Standard errors of population estimates before and after calibration. The information is grouped by Domain (appearing as row headers), Before Calibration and After Calibration (appearing as column headers).
| Domain |
Before Calibration |
After Calibration |
| 50-State |
National |
Naive |
Fuller |
Proposed |
| Residency: |
Urban |
1,922 |
2,664 |
0 |
2,664 |
2,664 |
| Rural |
1,922 |
2,598 |
0 |
2,598 |
2,598 |
| Sex: |
Male |
2,117 |
1,074 |
0 |
1,074 |
1,074 |
| Female |
2,117 |
1,112 |
0 |
1,112 |
1,112 |
| Rural: |
Male |
2,118 |
1,399 |
853 |
1,658 |
1,533 |
| Female |
2,514 |
1,797 |
853 |
1,964 |
1,970 |
| Urban: |
Male |
1,595 |
1,449 |
853 |
1,709 |
1,641 |
| Female |
979 |
1,271 |
853 |
1,470 |
1,547 |
ISSN : 1492-0921
Editorial policy
Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.
Submission of Manuscripts
Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).
Note of appreciation
Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.
Standards of service to the public
Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.
Copyright
Published by authority of the Minister responsible for Statistics Canada.
© His Majesty the King in Right of Canada as represented by the Minister of Industry, 2022
Use of this publication is governed by the Statistics Canada Open Licence Agreement.
Catalogue No. 12-001-X
Frequency: Semi-annual
Ottawa