Multiple-frame surveys for a multiple-data-source world
Section 2. Classical multiple-frame survey structure and assumptions
First, let’s look at an
example of what I shall call a “classical” multiple-frame survey
a survey that is designed to take probability
samples from each of a fixed number of frames
and define the notation and assumptions that
will be used to describe estimators and their properties.
2.1 National Survey of America’s Families
The goal of the 1997 National Survey of America’s
Families (NSAF) was to provide information on social and economic
characteristics of the U.S. civilian noninstitutional population under age 65,
with emphasis on obtaining reliable estimates for persons and families
particularly families with children
below 200 percent of the poverty threshold.
Estimates were desired for the nation as a whole; in addition, separate
estimates were desired for 13 states that were purposively selected to vary by
geographic region, dominant political party, size, and fiscal capacity.
To meet the precision requirements for estimates, it
was desired to have an effective sample size of about 800 poor children in each
state. This goal could have been met by taking a household sample from an area
frame. Waksberg et al.
(1997b) had determined that screening
households for income and subsampling nonpoor households would be the most
cost-effective way of achieving the desired sample sizes in an area-frame
sample, but the cost would be high because only about one in eight families was
expected to have children and be under 200 percent of the poverty threshold.
Screening costs would be greatly reduced if the survey
could be conducted by telephone using random digit dialing (RDD). But Current
Population Survey data indicated that about 20 percent of families living in
poverty did not have telephones, so the RDD frame was expected to have
substantial undercoverage of the target population. Moreover, households under
200 percent of poverty without telephones might have different income levels or
health characteristics than households under 200 percent of poverty with telephones.
Thus, a sample from the area frame would provide high
coverage but also come with unacceptably high costs. An RDD survey would have
lower costs but would have substantial undercoverage of the population of
interest. Waksberg et al.
(1997a) used a dual-frame survey, with
one sample from the area frame and a second sample chosen independently from
the RDD frame, to take advantage of the lower costs of an RDD sample yet also
cover nontelephone households. Figure 2.1(a) shows the structure of the
two frames.
To further reduce costs, Waksberg et al. (1997a) excluded census block groups with few nontelephone
households from the area frame; according to the 1990 census, the excluded
areas accounted for less than ten percent of the nontelephone households in
each state. With this exclusion, the area and RDD frames each contained
households not found in the other frame, as shown in Figure 2.1(b).
Households with telephones that were in the
non-excluded block groups were present in both frames. If a probability sample
were taken from each frame, households in that overlap (the dark shaded area in
Figure 2.1(b)) could be selected in both samples. The survey designers
could either conduct the interview with all households in each sample and then
deal with the multiplicity in the estimation (an overlap design), or screen out
the households in one of the frames that were also in the other frame (a
screening design).

Description for Figure 2.1
Figure illustrating two frame coverage methods for the NSAF. The dark shaded area is in both frames: the random digit dialing (RDD) frame and the area frame. In the first method, (a) Full Area Frame, the dual-frame survey uses a first sample from the area frame and a second sample chosen independently from the RDD frame. In the second method, (b) Restricted Area Frame, census block groups with few non-telephone households from the area frame are excluded.
Waksberg and his
colleagues chose to use screening. Households in the area sample were asked if
they had a telephone, and only those without telephones were administered the
detailed interview. The detailed interview was lengthy and expensive to
conduct; screening out the telephone households during a short interview saved
resources that could be used to increase the number of nontelephone households
in the sample. Households with telephones were sampled only through the RDD
frame; households in the RDD sample with no children and above 200 percent of
the poverty line were subsampled. Because a screening survey was used, the
combined sample from the two surveys was a stratified sample, and resources
were allocated to the two samples using stratified sampling formulas that
accounted for the higher cost of sampling from the area frame.
2.2 Notation and assumptions for multiple-frame surveys
In classical
multiple-frame surveys such as the NSAF, a number of assumptions are needed to
be able to obtain unbiased estimates of population characteristics along with
confidence intervals having approximately correct coverage probabilities.
Suppose there are
frames. A
population domain
is defined by
the intersections of the frames: domain {1, 3, 4}, for example,
contains the population units that are in Frames 1, 3, and 4 but not in
any of the other frames. Let
denote the set
of possible domains; depending on the overlap of units,
can contain
between 1 and
domains. Figure 2.2
shows three examples of frame relationships. When Frame 1 is complete but
Frame 2 is incomplete as in Figure 2.2(a),
any population
unit in Frame 2 is also in Frame 1. For an overlapping dual-frame
survey such as that in Figure 2.2(b),

Description for Figure 2.2
Figure illustrating three examples of frame structures. On the left, in (a) Frame 1 has complete coverage and Frame 2 is incomplete. Any population unit in Frame 2 is also in Frame 1. In the middle, in (b) Frames 1 and 2 are both incomplete but overlap. On the right, in (c) Frame 1 is complete; Frames 2, 3, and 4 are all incomplete but Frames 3 and 4 overlap.
Define
if unit
is in domain
and 0
otherwise, and let
if unit
is in Frame
and 0
otherwise. Frame
has population
size
and domain
has population
size
these sizes may
be known or unknown. The target population has a total of
units.
The following assumptions
are typically made in order to draw inferences from classical multiple-frame
surveys.
The
union of the
frames covers the target population.
The
sample
taken from Frame
is a probability sample where unit
has probability
of being in
Let
represent the final weight for unit
in
options for
include the design weight
the Hájek weight
with
or a nonresponse-adjusted weight.
The
samples
are selected independently.
The
domain membership of each unit
in
is known.
The
estimator of the population total in domain
from
is approximately unbiased for
for all Frames
containing domain
and for all variables
There
is no measurement error. If unit
is in Frame
and Frame
will have the same value if measured in
as it will if measured in
These are strong
assumptions; some relaxation of individual assumptions is possible for specific
estimators, as discussed in Section 3. But they are weaker than
assumptions needed for some of the other possible data integration methods.
Record linkage, for example, has an implicit assumption that unit
in Frame
can be matched
with a specific unit in Frame
For
multiple-frame surveys, one must know whether a unit sampled from Frame
is also in
other frames, but does not need to identify the matched unit.
2.3 Were the assumptions met in the NSAF?
Survey assumptions are
rarely met exactly in practice, and the NSAF was no exception. Assumption (A1)
was not met because of the exclusion of block groups with high telephone
ownership. The sample from the area frame yielded fewer nontelephone households
than expected, perhaps because of measurement error in the 1990 census or
population changes since 1990. In addition, post-survey investigations using
data from the 1997 Current Population Survey indicated that the block groups
excluded from the frame may have had more nontelephone households than
anticipated (Waksberg, Brick, Shapiro, Flores-Cervantes, Bell and
Ferraro, 1998).
Although independent
probability samples were taken from each frame, each sample had nonresponse.
The estimated response rates for children were 65 percent in the RDD sample and
84 percent in the area sample. The weighting procedure attempted to address potential
bias from undercoverage and nonresponse. The weights of the nontelephone
households in the area sample were ratio-adjusted to attempt to compensate for
undercoverage from the block group exclusions. Nonresponse-adjusted weights
were calculated separately for the area- and RDD-frame samples, and then the
combined samples were poststratified to Census Bureau control totals (Brick,
Shapiro, Flores-Cervantes, Ferraro and Strickler, 1999). Groves and Wissoker (1999) found little evidence of residual bias in their
nonresponse bias analysis; one of the few differences they reported was that
households in the RDD sample that required more calls for contact, and
households in a subsample taken of nonrespondents, were slightly less likely to
be receiving food assistance.
In the NSAF, the domain
membership was determined by asking household respondents in the area sample if
they had a working telephone. If that question was answered accurately, then
Assumption (A4) was met. The investigators attempted to reduce measurement
error for Assumption (A6) by having centralized telephone interviewers conduct
all of the detailed interviews; households in the area frame were interviewed
over a cellular telephone brought by the field representative. Because
interviews in domain
were obtained
only from the RDD sample, however, no data are available for evaluating
possible measurement error or relative nonresponse bias for the two samples.
Waksberg had used
dual-frame surveys several times prior to the NSAF, mostly to increase sample
sizes when sampling rare populations, but he recommended using them only when a
simpler design would not meet the survey objectives. He wrote: “The price is
additional complexity in the sampling operations and the possibility of error
if the matching of the two frames is not done carefully.... My instincts are
that a more complex scheme should not be used unless there is a reasonably good
pay-off” (Waksberg, 1986).
Was the extra
complication and expense of the dual-frame design worth the effort in the NSAF?
Because telephone households were screened out of the area sample, and because
the yield of nontelephone households was less than anticipated, only 1,488 of
the total of 44,461 interviewed households came from the area sample. But
because of the high poverty rate of the nontelephone households, the estimated
percentage of children in households under 200 percent of the poverty threshold
was about 3.6 percentage points higher with the full sample than with the RDD
sample alone. Even though for many variables there was only a small difference
between the full-sample estimate and the RDD-sample estimate, that difference
could not have been evaluated without the area sample.
ISSN : 1492-0921
Editorial policy
Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.
Submission of Manuscripts
Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).
Note of appreciation
Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.
Standards of service to the public
Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.
Copyright
Published by authority of the Minister responsible for Statistics Canada.
© His Majesty the King in Right of Canada as represented by the Minister of Industry, 2022
Use of this publication is governed by the Statistics Canada Open Licence Agreement.
Catalogue No. 12-001-X
Frequency: Semi-annual
Ottawa