6. Simulation studies
Alina Matei and M. Giovanna Ranalli
Previous | Next
We
evaluate the performance of the estimator presented in Section 5 by means of a
Monte Carlo simulation under two different settings. The first one uses a real
data set as the population and considers variables of interest that are all
binary, while the second one uses simulated population data with variables of
interest that are continuous. Results from the first setting are presented in
Section 6.1, while those from the second setting are presented in Section 6.2.
In
both settings, simple random sampling without replacement is employed and the
following estimators are considered:
the
Horvitz-Thompson estimator in the case of full response is computed as a
benchmark in the absence of nonresponse.
the naive
estimator given in (5.1); no explicit action is taken to adjust for unit and
item nonresponse. Note that for simple random sampling without replacement, it
reduces to
where
is the size of
the set
and it is the
same as the Horvitz-Thompson estimator adjusted for unit nonresponse that
assumes uniform response probabilities estimated by
the three-phase
estimator proposed in Section 5, Equation (5.2).
the three-phase
estimator that uses the true values for the response probabilities
and
is also computed
for comparison with
to understand
the effect of estimating the response probabilities.
The
simulations are carried out in R version 2.15, using the R package ‘ltm’ (Rizopoulos 2006) to fit the
latent trait models. The following performance measures are computed for each
estimator, generically denoted below by
where suffix
is dropped for
ease of notation
denotes the
population total):
-
the Monte
Carlo Bias
where
is the value of
the estimator
at the
simulation run
and
is total number
of simulation runs;
the
Relative Bias
-
the Monte
Carlo Standard Deviation
-
the Monte Carlo Mean Squared Error
6.1 Simulation setting 1
We
consider the Abortion data set formed by four binary variables extracted from
the 1986 British Social Attitudes Survey and concerning the attitude towards
abortion. The data is available in the R package ‘ltm’ (Rizopoulos 2006).
individuals
answered the following questions after being asked if the law should allow
abortion under the circumstances presented under each item:
- The woman decides
on her own that she does not wish to keep the baby.
- The couple agrees
that they do not wish to have a child.
- The woman is not
married and does not wish to marry the man.
- The couple cannot
afford any more children.
The variable of interest
is selected to
be the second one
with a total
in the
population.
The
data is analyzed by Bartholomew et al. (2002) as an example in which a
latent variable can be found that measures the attitude towards abortion. At
the population level, we compute the latent variable (denoted here by
using Model
(4.2) on the
data. The
correlation between the values
and
is
approximatively equal to 0.85, for
Afterwards, we
have set
for all
At
the population level, the unit response probabilities are generated using the
following response model
with
to simulate
nonignorable nonresponse. The population mean of
is approximately
0.74.
To
generate item response probabilities at the population level, the following model
is used
where
for
while
takes different
values according to
in particular,
and
The nominal item
nonresponse rate for the four items in the population is 35%, 42%, 47%, 31%,
respectively.
We
draw
simple random
samples without replacement from the population using two sample sizes:
and
In each sample
the units are
classified as respondents according to Poisson sampling, using the
probabilities
computed as in Equation
(6.1) and resulting in the set
Then, given
the matrix
is constructed
where the values
are drawn using
Poisson sampling with probabilities
defined in
(6.2). In each simulation run, Model (4.2) and the respondents set
are used to
compute the variable
for all
as described in
Section 4.4. Model (4.4) is fitted to obtain
The average item
nonresponse rate over simulations for the four items is found to be 26%, 33%,
38% and 23%. The jackknife variance estimator was computed as described in
Section 5 using the gencalib() function in R package ‘sampling’ (Tillé and Matei 2012) and
the logistic distance (Deville,
Särndal and Sautory 1993).
Table
6.1 reports the results for
and
As expected,
and
have almost zero
bias, with the second one showing a relatively larger MSE that is due uniquely
to the smaller sample size. The naive estimator shows a very large negative
bias. This is due to the fact that units with a zero value of
are less likely
to respond and the total is clearly underestimated. The estimator
shows a much
smaller bias than the naive estimator. Note that the performance of the
proposed estimator is mostly driven by absolute bias, so that the performance
is not particularly different when increasing the sample size, apart from a
decrease in variance. If we compare
and
we note that
still suffers
from some bias that comes from response model misspecification (we are not
accounting for the variables of interest values).
For
the proposed estimator, the jackknife variance estimator was also tested by
looking at the empirical coverage of a 95% confidence interval computed for
each replicate as
For
the mean value
of
over simulations
was 54.8, while for
53.3, with a 95%
coverage rate of 94.6% and 96.3%, respectively. The replicate estimator
overestimates the Monte Carlo standard deviation reported for
in Table 6.1 in
both cases, but shows good coverage rates.
Table 6.1
Simulation results for setting 1 - Abortion data set
Table summary
This table displays the results of Simulation results for setting 1 - Abortion data set. The information is grouped by Estimator (appearing as row headers), B,
, MSE and % RB (appearing as column headers).
| Estimator |
B |
|
MSE |
% RB |
|
|
|
|
0.05 |
24.5 |
600.5 |
0.1
|
|
|
-126.5 |
19.4 |
16,378.6 |
-56.2 |
|
|
20.6 |
32.4 |
1,474.1 |
9.1 |
|
|
0.02 |
35.0 |
1,225.0 |
0.1 |
|
|
|
|
-0.06 |
16.0 |
255.5 |
0.1 |
|
|
-126.9 |
13.5 |
16,284.1 |
-56.4 |
|
|
17.9 |
21.9 |
802.2 |
8.0 |
|
|
-0.1 |
23.7 |
559.9 |
0.1 |
To
study the performance of the latent model on the population level and the
correlation between the variable of interest and the estimated latent variable,
we apply the procedure described earlier using
defined in (6.2)
to construct the matrix
for all
population units. We fit Model (4.2) on the population level and compute the
variable
for all
The Cronbach’s
alpha measure takes value 0.83 showing a good internal consistency of the
items. The correlation coefficient between the variable of interest and the
estimated latent variable takes value 0.76, indicating that the latent
auxiliary information has a strong power of predicting
as advocated in
the model of Cassel et al. (1983). Inspection of the two-way margins for
the matrix
gives the
residuals
between 0.03 and
0.23. Similarly, the three-way margins for the matrix
give residuals
between 0 and 1.19. This indicates that we have no reason to reject here the
one-factor latent Model (4.2) (see Bartholomew et al. 2002, page 186).
6.2 Simulation setting 2
We
generate
for
using a
multivariate normal distribution with mean 1. The degree of correlation between
and
is 0.8, with
We set the
variable of interest to be
and consider
different degrees of correlation between its values and those taken by
namely 0.3, 0.5,
0.8. The values of
are afterwards
standardized to have mean 0 and variance 1.
The
response probabilities are obtained by first computing
and then rescaling them to take values between 0.1 and 0.9, with a
population mean approximatively equal to 0.7.
The
item response probabilities are generated by first computing
where
and
and then
rescaling the values to be between 0.1 and 0.95.
We
draw
samples by
simple random sampling without replacement of size
For each sample
a response set
is created by
carrying out Poisson sampling with parameter
defined in
(6.3). Each element of the matrix
is generated
using Poisson sampling with parameter
defined in
(6.4). Item nonresponse rates over simulations take approximately value 18%,
28%, 35%, 19%, 29%, 34%, for
respectively.
For each simulation run, Model (4.2) is used to compute the variable
for all
Model (4.4) is
then fitted to obtain
Table 6.2
Simulation results for setting 2 - Simulated continuous data
Table summary
This table displays the results of Simulation results for setting 2 - Simulated continuous data. The information is grouped by Estimator (appearing as row headers), B, , MSE and % RB (appearing as column headers).
| Estimator |
B |
|
MSE |
% RB |
| correlation coefficient 0.3 |
|
|
-0.7 |
131.6 |
17,331.2 |
-0.0 |
|
|
825.6 |
177.1 |
713,039.3 |
41.0 |
|
|
-227.4 |
188.0 |
87,033.0 |
-11.3 |
|
|
48.4 |
231.8 |
56,073.2 |
2.4 |
| correlation coefficient 0.5 |
|
|
0.1 |
135.0 |
18,220.5 |
0.0 |
|
|
972.6 |
176.2 |
977,009.5 |
50.7 |
|
|
-180.0 |
175.5 |
63,552.0 |
-9.4 |
|
|
74.8 |
212.7 |
50,844.0 |
3.9 |
| correlation coefficient 0.8 |
|
|
-0.1 |
134.1 |
17,992.0 |
-0.0 |
|
|
1,154.6 |
168.1 |
1,361,388.1 |
57.7 |
|
|
-184.8 |
164.4 |
61,173.0 |
-9.2 |
|
|
100.6 |
196.2 |
48,597.9 |
5.0 |
Table
6.2 reports on the performance of the estimators for the three values taken by
the nominal correlation coefficient between
and
and The proposed
estimator is always able to reduce bias over the naive estimator, even when the
correlation between the variable of interest and the latent variable gets
smaller. The relative bias takes acceptable values in most cases. Bias deserves
a closer look. The naive estimator in all cases largely overestimates the
total. This is expected, because the values
and
all go in the
same direction. Therefore, in our respondents sample, we are more likely to
find relative larger values for
by this
providing overestimation for the naive estimator. On the other hand,
underestimates
the total because it is based only on the observed units of
that do have
relatively large values for
but also
relatively large values for
and
and, therefore,
end up having a small weight.
The
matrix of population values
is constructed
in the same way as in Section 6.1 to validate the assumptions behind the 2PL
model. The Cronbach’s alpha takes approximately value 0.5 for the correlation
coefficient equal to 0.3, 0.6 for 0.5, and 0.7 for 0.8; the pairwise
association between the six items reveals values smaller than 0.01.
Inspection of the two-way and three-way margins of the matrix
gives residuals
that all take
values smaller than 4. Therefore, the one factor latent model can be accepted
and items all seem to be measuring the same latent trait.
Previous | Next