A method to correct for frame membership error in dual frame estimators
Section 6. Example: Angler survey
We illustrate the use of the
proposed bias correction procedure with its application to a dual frame mail
survey of anglers in North Carolina (NC) in 2009. It was a pilot survey testing
several changes to an ongoing program collecting recreational marine angler
effort by NOAA, where effort is defined as the number of fishing trips during a
specified time period. The two frames were a NC address frame and a license
frame, which included the names and addresses of anglers who had any of several
types of licenses. The target population of the pilot survey was recreational
anglers who fished in NC saltwater, regardless of where they lived. The target
time period of fishing was Wave 6 of 2009 (November - December). These two
frames together had some undercoverage because unlicensed anglers whose home
address was outside NC were not included in the union of the two frames.
6.1 Sample design
The address frame was obtained
from the US Postal Service and covered all households in NC. The license frame
included all persons listed on the NC database of licensed anglers as of the
date of the license pull, which was several days before the mailing of the
surveys. Independent samples of addresses were drawn from the two frames.
Estimates were made of the fishing effort in NC during Wave 6, 2009 using the
Hartley estimator with
The sample from the address frame was a
complex sample, and itself involved two phases. The sample from the license
frame required only one phase. In this application, units were at risk of
misclassification only if they were chosen from the larger of the two frames,
the address frame, since it was not known whether the persons in those
households owned fishing licenses. The analysts did know that all persons
selected from the license frame had a valid license during the wave and also
knew whether or not they had an address in North Carolina, since the their
address was available from the frame.
The sample design for the
address frame was conducted in two phases. A random sample of 1,800 addresses
was selected first, stratified by geography. The strata were defined as
addresses in coastal and non-coastal counties of NC, with samples of 900 each.
A screening questionnaire asked whether any household member fished in
saltwater in the last 12 months. The second phase sample consisted of one
randomly chosen angler from every household that reported fishing by any
household member in the first phase. One additional angler was selected from
households reporting more than one active adult angler. The reason for this
two-phase construction was to avoid sending a lenthy questionnaire to
non-angling households, in order to decrease cost and increase response rate.
The license frame was obtained
from the NC license database. All individuals who were listed on the database
on the day the frame was pulled and were licensed to fish during the target
period (Wave 6, 2009) were included. The license frame was preprocessed to make
it suitable for sampling. Multiple records with the same core data (name, date
of birth and address) were deleted, as were anglers identified as being under
18. The license frame was divided into three strata: coastal, non-coastal, and
out-of-state. The file was sorted by address, and a systematic sample of 450
anglers was selected from each stratum. Sampling in the license frame was
conducted in a single phase, and used a questionnaire identical to the second
phase questionnaire for the address frame sample. As in the address frame, a
supplemental sample was selected from addresses with more than one licensed
angler present on the frame.
The common questionnaire used
for both frames included an item that asked whether the respondent had a NC
marine recreational fishing license. This question was included to determine
domain membership for those chosen from the address frame. However, analysts
observed that some respondents from the license frame reported they did not
have a license, which alerted them to the possible presence of domain
misclassification error. As a result, an operation was undertaken to determine
true domain membership for respondents from the address frame. We attempted to
match 100% of the sampled addresses to the license frame. The last part of this
process involved a human matcher trying to identify if a particular angler
within a matched address appeared to be the licensed angler, based on available
data from the license frame and survey responses. This was a time-consuming
operation, which motivated the search for alternatives. The goal was to develop
methods for the operational survey that allowed for determinination of true
domain status for only a subset of the sample. However, since we did have
access to the true domain status for the entire sample, we were able to examine
misclassification probabilities and subdomain means, as well as to compare
results with an estimate made from “true”
data.
Even though we observed that
some on the license frame made errors concerning their license status, this did
not cause a domain misclassification error for the license frame because the
true license status was known. For the license frame, domain misclassfication could
occur only if the in or out-of-state status of the household could not be
determined accurately. It is possible such errors could occur. For example, if
a household with an out-of-state address on the license frame were sampled, but
it had a second in-state address that appeared on the address frame, then the
domain assignment would be incorrect. However, the incidence of such cases was
believed to be small enough that it could be ignored, so we treated the
misclassification probabilities as if they were known to be 0 for the units on
the license frame in our analysis.
6.2 Sample analysis
The
domain misclassification rates for the sample from the address frame are shown
separately by stratum in Table 6.1. In this case domain
contains those respondents from the address frame who report
that they are licensed, while domain
contains those reporting they are not licensed. Anglers who
reported that they are unlicensed have about a 5% error rate in both strata.
Those who reported they are licensed have extremely high error rates, with
those in non-coastal counties more likely to be wrong than right! We point out
that the address frame respondents from which these estimates were reported are
those who were in the second phase of the address frame sample. This means that
they had screened in because their household had at least one person who had
fished in the last 12 months. As a result, a very high fraction of these
respondents were anglers compared to the general population.
Table 6.1
Misclassification rates calculated from full sample (Address frame, Wave 6, 2009)
Table summary
This table displays the results of Misclassification rates calculated from full sample (Address frame. The information is grouped by (appearing as row headers), Proportion of those who report not being licensed who are and Proportion of those who report being licensed who are not
(appearing as column headers).
|
Proportion of those who report not being
licensed who are
|
Proportion of those who report being licensed
who are not
|
Coastal Stratum |
0.04 |
0.46 |
Non-coastal Stratum |
0.06 |
0.63 |
We
also examined the equal means assumptions using data from the address frame
sample. The estimated mean effort in each of the four categories of domain and
perceived domain membership are shown in Table 6.2. The columns classify
respondents into perceived domains, while the rows classify according to their
true domain. The table shows that respondents’ fishing behavior is consistent
with what they report their license status to be rather than what their true
status is. Thus, we believe that the equal means assumption of our proposed
method is more reasonable for the angler survey data than Lohr’s equal mean
assumption.
Table 6.2
Estimated mean #of fishing trips (SE) by subdomain for Wave 6 2009 NC Address frame
Table summary
This table displays the results of Estimated mean #of fishing trips (SE) by subdomain for Wave 6 2009 NC Address frame. The information is grouped by
for subdomains (appearing as row headers), reported no license
and reported license
(appearing as column headers).
for subdomains |
reported no license
|
reported license
|
true no license
|
0.34 (0.14) |
0.88 (0.41) |
true license
|
0.35 (0.46) |
0.98 (0.24) |
The
sample data contained weights provided by the survey designers that accounted
for the complex design and nonresponse adjustment. Because the domain
misclassification probabilities differed by stratum, we adjusted the weights as
described in Section 5 separately by stratum, using individual estimates
of misclassification for each address frame domain. We assumed no domain
misclassification for the license frame. Six estimates of effort were computed
and are shown in Table 6.3:
- Uncorrected
Hartley estimator (labeled Unadj. in
table): The perceived domain membership was used to estimate the total, using
the Hartley estimator as in (3.3);
- 20%, 40%,
100%-subsampled estimator: Units from each stratum of the phase 1 address
frame sample were subsampled, and their true domains were used to estimate the
misclassification probabilities. The weight adjustments were calculated based
on the estimated misclassification probabilities;
- Corrected
Hartley estimator (True): The true
domain membership ascertained from the matching operation was used to estimate
the total number using the Hartley estimator with the original weights, as in (3.1).
This is considered the best available estimate since it requires no assumptions
for unbiasedness.
The
first row contains the five estimates, the second row contains an estimate of
bias for each, and the third row shows the square root of the sum of estimated
variance and squared bias. The bias displayed in row 2 is the difference
between each estimate and the corrected Hartley estimate (True column). We acknowledge that the address matching algorithm is
undoubtedly not perfect, which means that the “True” estimator may still
contain bias in addition to its sampling variability. Still, taking this as our
best assessment of bias, we see that after applying the bias correction method,
the estimated bias is reduced by using the bias-corrected estimator from
to
The difference between
and
with 100% subsampling may reflect failure of
the required equal mean assumptions. The estimated RMSE is reduced by using a
bias-adjusted method instead of the unadjusted Hartley estimator by about
Table 6.3
Estimated total fishing trips (Address frame, Wave 6, 2009)
Table summary
This table displays the results of Estimated total fishing trips (Address frame. The information is grouped by (appearing as row headers), Unadj., 20% sub.
, 40% sub.
, 100% sub. and True (appearing as column headers).
|
Unadj. |
20% sub.
|
40% sub.
|
100% sub. |
True |
Estimate |
731,430 |
889,860 |
863,488 |
905,947 |
942,360 |
Bias |
210,930 |
52,500 |
78,872 |
36,413 |
0 |
RMSE |
244,531 |
181,809 |
180,311 |
176,954 |
213,966 |
ISSN : 1492-0921
Editorial policy
Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.
Submission of Manuscripts
Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).
Note of appreciation
Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.
Standards of service to the public
Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.
Copyright
Published by authority of the Minister responsible for Statistics Canada.
© Her Majesty the Queen in Right of Canada as represented by the Minister of Industry, 2019
Use of this publication is governed by the Statistics Canada Open Licence Agreement.
Catalogue No. 12-001-X
Frequency: Semi-annual
Ottawa