Multiple-frame surveys for a multiple-data-source world
Section 2. Classical multiple-frame survey structure and assumptions

Table of contents

First, let’s look at an example of what I shall call a “classical” multiple-frame survey $-$ a survey that is designed to take probability samples from each of a fixed number of frames $-$ and define the notation and assumptions that will be used to describe estimators and their properties.

2.1 National Survey of America’s Families

The goal of the 1997 National Survey of America’s Families (NSAF) was to provide information on social and economic characteristics of the U.S. civilian noninstitutional population under age 65, with emphasis on obtaining reliable estimates for persons and families $-$ particularly families with children $-$ below 200 percent of the poverty threshold. Estimates were desired for the nation as a whole; in addition, separate estimates were desired for 13 states that were purposively selected to vary by geographic region, dominant political party, size, and fiscal capacity.

To meet the precision requirements for estimates, it was desired to have an effective sample size of about 800 poor children in each state. This goal could have been met by taking a household sample from an area frame. Waksberg et al. (1997b) had determined that screening households for income and subsampling nonpoor households would be the most cost-effective way of achieving the desired sample sizes in an area-frame sample, but the cost would be high because only about one in eight families was expected to have children and be under 200 percent of the poverty threshold.

Screening costs would be greatly reduced if the survey could be conducted by telephone using random digit dialing (RDD). But Current Population Survey data indicated that about 20 percent of families living in poverty did not have telephones, so the RDD frame was expected to have substantial undercoverage of the target population. Moreover, households under 200 percent of poverty without telephones might have different income levels or health characteristics than households under 200 percent of poverty with telephones.

Thus, a sample from the area frame would provide high coverage but also come with unacceptably high costs. An RDD survey would have lower costs but would have substantial undercoverage of the population of interest. Waksberg et al. (1997a) used a dual-frame survey, with one sample from the area frame and a second sample chosen independently from the RDD frame, to take advantage of the lower costs of an RDD sample yet also cover nontelephone households. Figure 2.1(a) shows the structure of the two frames.

To further reduce costs, Waksberg et al. (1997a) excluded census block groups with few nontelephone households from the area frame; according to the 1990 census, the excluded areas accounted for less than ten percent of the nontelephone households in each state. With this exclusion, the area and RDD frames each contained households not found in the other frame, as shown in Figure 2.1(b).

Households with telephones that were in the non-excluded block groups were present in both frames. If a probability sample were taken from each frame, households in that overlap (the dark shaded area in Figure 2.1(b)) could be selected in both samples. The survey designers could either conduct the interview with all households in each sample and then deal with the multiplicity in the estimation (an overlap design), or screen out the households in one of the frames that were also in the other frame (a screening design).

Figure 2.1 Frame coverage for the NSAF. The dark shaded area is in both frames

Description for Figure 2.1

Figure illustrating two frame coverage methods for the NSAF. The dark shaded area is in both frames: the random digit dialing (RDD) frame and the area frame. In the first method, (a) Full Area Frame, the dual-frame survey uses a first sample from the area frame and a second sample chosen independently from the RDD frame. In the second method, (b) Restricted Area Frame, census block groups with few non-telephone households from the area frame are excluded.

Waksberg and his colleagues chose to use screening. Households in the area sample were asked if they had a telephone, and only those without telephones were administered the detailed interview. The detailed interview was lengthy and expensive to conduct; screening out the telephone households during a short interview saved resources that could be used to increase the number of nontelephone households in the sample. Households with telephones were sampled only through the RDD frame; households in the RDD sample with no children and above 200 percent of the poverty line were subsampled. Because a screening survey was used, the combined sample from the two surveys was a stratified sample, and resources were allocated to the two samples using stratified sampling formulas that accounted for the higher cost of sampling from the area frame.

2.2 Notation and assumptions for multiple-frame surveys

In classical multiple-frame surveys such as the NSAF, a number of assumptions are needed to be able to obtain unbiased estimates of population characteristics along with confidence intervals having approximately correct coverage probabilities.

Suppose there are $Q$ frames. A population domain $d$ is defined by the intersections of the frames: domain {1, 3, 4}, for example, contains the population units that are in Frames 1, 3, and 4 but not in any of the other frames. Let $D$ denote the set of possible domains; depending on the overlap of units, $D$ can contain between 1 and $2^{Q} - 1$ domains. Figure 2.2 shows three examples of frame relationships. When Frame 1 is complete but Frame 2 is incomplete as in Figure 2.2(a), $D = {{1}, {1, 2}};$ any population unit in Frame 2 is also in Frame 1. For an overlapping dual-frame survey such as that in Figure 2.2(b), $D = {{1}, {2}, {1, 2}} .$

Figure 2.2 Three frame structures. (a) Frame 1 has complete coverage and Frame 2 is incomplete. (b) Frames 1 and 2 are both incomplete but overlap. (c) Frame 1 is complete; Frames 2, 3, and 4 are all incomplete but Frames 3 and 4 overlap

Description for Figure 2.2

Figure illustrating three examples of frame structures. On the left, in (a) Frame 1 has complete coverage and Frame 2 is incomplete. Any population unit in Frame 2 is also in Frame 1. In the middle, in (b) Frames 1 and 2 are both incomplete but overlap. On the right, in (c) Frame 1 is complete; Frames 2, 3, and 4 are all incomplete but Frames 3 and 4 overlap.

Define $δ_{i} (d) = 1$ if unit $i$ is in domain $d$ and 0 otherwise, and let $δ_{i}^{(q)} = 1$ if unit $i$ is in Frame $q$ and 0 otherwise. Frame $q$ has population size $N^{(q)}$ and domain $d$ has population size $N_{d};$ these sizes may be known or unknown. The target population has a total of $N$ units.

The following assumptions are typically made in order to draw inferences from classical multiple-frame surveys.

(A1)

The union of the $Q$ frames covers the target population.

(A2)

The sample $S_{q}$ taken from Frame $q$ is a probability sample where unit $i$ has probability $π_{i}^{(q)}$ of being in $S_{q} .$ Let $w_{i}^{(q)}$ represent the final weight for unit $i$ in $S_{q};$ options for $w_{i}^{(q)}$ include the design weight $1 / π_{i}^{(q)},$ the Hájek weight $N^{(q)} / [{\hat{N}}^{(q)} π_{i}^{(q)}]$ with ${\hat{N}}^{(q)} = \sum_{j \in S_{q}} 1 / π_{j}^{(q)},$ or a nonresponse-adjusted weight.

(A3)

The samples $S_{1}, \dots, S_{Q}$ are selected independently.

(A4)

The domain membership of each unit $i$ in $S_{q},$ ${δ_{i} (d), d \in D},$ is known.

(A5)

The estimator of the population total in domain $d$ from $S_{q},$ ${\hat{Y}}_{d}^{(q)} = \sum_{i \in S_{q}} δ_{i} (d) w_{i}^{(q)} y_{i},$ is approximately unbiased for $Y_{d} = \sum_{i =1}^{N} δ_{i} (d) y_{i},$ for all Frames $q$ containing domain $d$ and for all variables $y .$

(A6)

There is no measurement error. If unit $i$ is in Frame $q$ and Frame $q^{'},$ $y_{i}$ will have the same value if measured in $S_{q}$ as it will if measured in $S_{q^{'}} .$

These are strong assumptions; some relaxation of individual assumptions is possible for specific estimators, as discussed in Section 3. But they are weaker than assumptions needed for some of the other possible data integration methods. Record linkage, for example, has an implicit assumption that unit $i$ in Frame $q$ can be matched with a specific unit in Frame $q^{'} .$ For multiple-frame surveys, one must know whether a unit sampled from Frame $q$ is also in other frames, but does not need to identify the matched unit.

2.3 Were the assumptions met in the NSAF?

Survey assumptions are rarely met exactly in practice, and the NSAF was no exception. Assumption (A1) was not met because of the exclusion of block groups with high telephone ownership. The sample from the area frame yielded fewer nontelephone households than expected, perhaps because of measurement error in the 1990 census or population changes since 1990. In addition, post-survey investigations using data from the 1997 Current Population Survey indicated that the block groups excluded from the frame may have had more nontelephone households than anticipated (Waksberg, Brick, Shapiro, Flores-Cervantes, Bell and Ferraro, 1998).

Although independent probability samples were taken from each frame, each sample had nonresponse. The estimated response rates for children were 65 percent in the RDD sample and 84 percent in the area sample. The weighting procedure attempted to address potential bias from undercoverage and nonresponse. The weights of the nontelephone households in the area sample were ratio-adjusted to attempt to compensate for undercoverage from the block group exclusions. Nonresponse-adjusted weights were calculated separately for the area- and RDD-frame samples, and then the combined samples were poststratified to Census Bureau control totals (Brick, Shapiro, Flores-Cervantes, Ferraro and Strickler, 1999). Groves and Wissoker (1999) found little evidence of residual bias in their nonresponse bias analysis; one of the few differences they reported was that households in the RDD sample that required more calls for contact, and households in a subsample taken of nonrespondents, were slightly less likely to be receiving food assistance.

In the NSAF, the domain membership was determined by asking household respondents in the area sample if they had a working telephone. If that question was answered accurately, then Assumption (A4) was met. The investigators attempted to reduce measurement error for Assumption (A6) by having centralized telephone interviewers conduct all of the detailed interviews; households in the area frame were interviewed over a cellular telephone brought by the field representative. Because interviews in domain ${1, 2}$ were obtained only from the RDD sample, however, no data are available for evaluating possible measurement error or relative nonresponse bias for the two samples.

Waksberg had used dual-frame surveys several times prior to the NSAF, mostly to increase sample sizes when sampling rare populations, but he recommended using them only when a simpler design would not meet the survey objectives. He wrote: “The price is additional complexity in the sampling operations and the possibility of error if the matching of the two frames is not done carefully.... My instincts are that a more complex scheme should not be used unless there is a reasonably good pay-off” (Waksberg, 1986).

Was the extra complication and expense of the dual-frame design worth the effort in the NSAF? Because telephone households were screened out of the area sample, and because the yield of nontelephone households was less than anticipated, only 1,488 of the total of 44,461 interviewed households came from the area sample. But because of the high poverty rate of the nontelephone households, the estimated percentage of children in households under 200 percent of the poverty threshold was about 3.6 percentage points higher with the full sample than with the RDD sample alone. Even though for many variables there was only a small difference between the full-sample estimate and the RDD-sample estimate, that difference could not have been evaluated without the area sample.

ISSN : 1492-0921

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Use of this publication is governed by the Statistics Canada Open Licence Agreement.

Catalogue No. 12-001-X

Frequency: Semi-annual

Ottawa

Date modified:: 2022-01-06

Language selection

Search and menus

Search

Multiple-frame surveys for a multiple-data-source world
Section 2. Classical multiple-frame survey structure and assumptions

2.1 National Survey of America’s Families

2.2 Notation and assumptions for multiple-frame surveys

2.3 Were the assumptions met in the NSAF?

Multiple-frame surveys for a multiple-data-source world Section 2. Classical multiple-frame survey structure and assumptions

2.1 National Survey of America’s Families

2.2 Notation and assumptions for multiple-frame surveys

2.3 Were the assumptions met in the NSAF?

Editorial policy

Submission of Manuscripts

Note of appreciation

Standards of service to the public

Copyright

Multiple-frame surveys for a multiple-data-source world
Section 2. Classical multiple-frame survey structure and assumptions