Browse by

6. Simulation studies

Alina Matei and M. Giovanna Ranalli

We evaluate the performance of the estimator presented in Section 5 by means of a Monte Carlo simulation under two different settings. The first one uses a real data set as the population and considers variables of interest that are all binary, while the second one uses simulated population data with variables of interest that are continuous. Results from the first setting are presented in Section 6.1, while those from the second setting are presented in Section 6.2.

In both settings, simple random sampling without replacement is employed and the following estimators are considered:

$HT= \sum_{k \in s} y_{k j} / π_{k} :$ the Horvitz-Thompson estimator in the case of full response is computed as a benchmark in the absence of nonresponse.
${\hat{Y}}_{j, naive} :$ the naive estimator given in (5.1); no explicit action is taken to adjust for unit and item nonresponse. Note that for simple random sampling without replacement, it reduces to ${\hat{Y}}_{j, naive} = N \sum_{k \in r_{j}} y_{k j} / n_{r_{j}},$ where $n_{r_{j}}$ is the size of the set $r_{j},$ and it is the same as the Horvitz-Thompson estimator adjusted for unit nonresponse that assumes uniform response probabilities estimated by $n_{r_{j}} / n .$
${\hat{Y}}_{j, p q} :$ the three-phase estimator proposed in Section 5, Equation (5.2).
${\hat{Y}}_{j, p q, true} :$ the three-phase estimator that uses the true values for the response probabilities $p_{k}$ and $q_{k j}$ is also computed for comparison with ${\hat{Y}}_{j, p q}$ to understand the effect of estimating the response probabilities.

The simulations are carried out in R version 2.15, using the R package ‘ltm’ (Rizopoulos 2006) to fit the latent trait models. The following performance measures are computed for each estimator, generically denoted below by $\hat{Y}$ where suffix $j$ is dropped for ease of notation $(Y$ denotes the population total):

the Monte Carlo Bias

$B= E_{sim} (\hat{Y}) - Y,$

where $E_{sim} (\hat{Y}) = \sum_{i = 1}^{M} {\hat{Y}}_{i} / M, {\hat{Y}}_{i}$ is the value of the estimator $\hat{Y}$ at the $i^{th}$ simulation run and $M$ is total number of simulation runs;

the Relative Bias

$RB= \frac{B}{Y};$
the Monte Carlo Standard Deviation

$\sqrt{VAR} = \sqrt{\frac{1}{M - 1} \sum_{i = 1}^{M} {({\hat{Y}}_{i} - E_{sim} (\hat{Y}))}^{2}};$
the Monte Carlo Mean Squared Error

${MSE=B}^{2} + VAR .$

6.1 Simulation setting 1

We consider the Abortion data set formed by four binary variables extracted from the 1986 British Social Attitudes Survey and concerning the attitude towards abortion. The data is available in the R package ‘ltm’ (Rizopoulos 2006). $N = 379$ individuals answered the following questions after being asked if the law should allow abortion under the circumstances presented under each item:

The woman decides on her own that she does not wish to keep the baby.
The couple agrees that they do not wish to have a child.
The woman is not married and does not wish to marry the man.
The couple cannot afford any more children.

The variable of interest $y_{j}$ is selected to be the second one $(j = 2)$ with a total $Y_{j} =225$ in the population.

The data is analyzed by Bartholomew et al. (2002) as an example in which a latent variable can be found that measures the attitude towards abortion. At the population level, we compute the latent variable (denoted here by $θ_{k}^{a})$ using Model (4.2) on the ${y_{k ℓ}}_{k = 1, \dots, N; ℓ = 1, \dots,4}$ data. The correlation between the values $y_{k ℓ}$ and $θ_{k}^{a}$ is approximatively equal to 0.85, for $ℓ = 1, \dots,4.$ Afterwards, we have set $θ_{k} = {\hat{θ}}_{k}^{a},$ for all $k = 1, \dots, N .$

At the population level, the unit response probabilities are generated using the following response model

$p_{k} = 1 / (1 + \exp (- (0 .7 + y_{k 2} + θ_{k} + 0 .2 ε_{k}))), (6.1)$

with $ε_{k} \sim U (0,1),$ to simulate nonignorable nonresponse. The population mean of $p_{k}$ is approximately 0.74.

To generate item response probabilities at the population level, the following model is used

$q_{k ℓ} = 1 / (1 + \exp (- (b_{ℓ} θ_{k} + a_{ℓ} + y_{k ℓ}))), for ℓ = 1, \dots,4, (6.2)$

where $b_{ℓ} = 3,$ for $ℓ = 1, \dots,4,$ while $a_{ℓ}$ takes different values according to $ℓ;$ in particular, $a_{1} = 1, a_{2} = 0, a_{3} = - 0.5$ and $a_{4} = 1.$ The nominal item nonresponse rate for the four items in the population is 35%, 42%, 47%, 31%, respectively.

We draw $M = 10,000$ simple random samples without replacement from the population using two sample sizes: $n = 50$ and $n = 100 .$ In each sample $s,$ the units are classified as respondents according to Poisson sampling, using the probabilities $p_{k}$ computed as in Equation (6.1) and resulting in the set $r .$ Then, given $r,$ the matrix ${x_{k ℓ}}_{k \in r; ℓ = 1, ..., 4}$ is constructed where the values $x_{k ℓ}$ are drawn using Poisson sampling with probabilities $q_{k ℓ}$ defined in (6.2). In each simulation run, Model (4.2) and the respondents set $r$ are used to compute the variable ${\hat{θ}}_{k}$ for all $k \in s$ as described in Section 4.4. Model (4.4) is fitted to obtain ${\hat{p}}_{k} .$ The average item nonresponse rate over simulations for the four items is found to be 26%, 33%, 38% and 23%. The jackknife variance estimator was computed as described in Section 5 using the gencalib() function in R package ‘sampling’ (Tillé and Matei 2012) and the logistic distance (Deville, Särndal and Sautory 1993).

Table 6.1 reports the results for $n = 50$ and $n = 100 .$ As expected, $HT$ and ${\hat{Y}}_{j, p q, true}$ have almost zero bias, with the second one showing a relatively larger MSE that is due uniquely to the smaller sample size. The naive estimator shows a very large negative bias. This is due to the fact that units with a zero value of $y_{j}$ are less likely to respond and the total is clearly underestimated. The estimator ${\hat{Y}}_{j, p q}$ shows a much smaller bias than the naive estimator. Note that the performance of the proposed estimator is mostly driven by absolute bias, so that the performance is not particularly different when increasing the sample size, apart from a decrease in variance. If we compare ${\hat{Y}}_{j, p q, true}$ and ${\hat{Y}}_{j, p q},$ we note that ${\hat{Y}}_{j, p q}$ still suffers from some bias that comes from response model misspecification (we are not accounting for the variables of interest values).

For the proposed estimator, the jackknife variance estimator was also tested by looking at the empirical coverage of a 95% confidence interval computed for each replicate as ${\hat{Y}}_{j, p q} \pm 1 .96 \sqrt{{\hat{V}}_{r}} .$ For $n = 50,$ the mean value of $\sqrt{{\hat{V}}_{r}}$ over simulations was 54.8, while for $n = 100,$ 53.3, with a 95% coverage rate of 94.6% and 96.3%, respectively. The replicate estimator overestimates the Monte Carlo standard deviation reported for ${\hat{Y}}_{j, p q}$ in Table 6.1 in both cases, but shows good coverage rates.

Table 6.1
Simulation results for setting 1 - Abortion data set
Table summary
This table displays the results of Simulation results for setting 1 - Abortion data set. The information is grouped by Estimator (appearing as row headers), B, $\sqrt{VAR}$ , MSE and % RB (appearing as column headers).
Estimator	B	$\sqrt{VAR}$	MSE	% RB
$n = 50$
$HT$	0.05	24.5	600.5	$<$ 0.1
${\hat{Y}}_{j, naive}$	-126.5	19.4	16,378.6	-56.2
${\hat{Y}}_{j, p q}$	20.6	32.4	1,474.1	9.1
${\hat{Y}}_{j, p q, true}$	0.02	35.0	1,225.0	$<$ 0.1
$n = 100$
$HT$	-0.06	16.0	255.5	$<$ 0.1
${\hat{Y}}_{j, naive}$	-126.9	13.5	16,284.1	-56.4
${\hat{Y}}_{j, p q}$	17.9	21.9	802.2	8.0
${\hat{Y}}_{j, p q, true}$	-0.1	23.7	559.9	$<$ 0.1

To study the performance of the latent model on the population level and the correlation between the variable of interest and the estimated latent variable, we apply the procedure described earlier using $q_{k ℓ}$ defined in (6.2) to construct the matrix ${x_{k ℓ}}_{k = 1, \dots, N; ℓ = 1, ..., 4}$ for all population units. We fit Model (4.2) on the population level and compute the variable $θ_{k}$ for all $k = 1, \dots, N .$ The Cronbach’s alpha measure takes value 0.83 showing a good internal consistency of the items. The correlation coefficient between the variable of interest and the estimated latent variable takes value 0.76, indicating that the latent auxiliary information has a strong power of predicting $y_{k 2},$ as advocated in the model of Cassel et al. (1983). Inspection of the two-way margins for the matrix ${x_{k ℓ}}$ gives the residuals ${(O - E)}^{2} / E$ between 0.03 and 0.23. Similarly, the three-way margins for the matrix ${x_{k ℓ}}$ give residuals between 0 and 1.19. This indicates that we have no reason to reject here the one-factor latent Model (4.2) (see Bartholomew et al. 2002, page 186).

6.2 Simulation setting 2

We generate ${y_{k 1}, \dots, y_{k 6}, θ_{k}}$ for $k = 1, \dots, N = 2,000$ using a multivariate normal distribution with mean 1. The degree of correlation between $y_{ℓ}$ and $y_{ℓ^{'}}$ is 0.8, with $ℓ, ℓ^{'} = 1, \dots,6, ℓ \neq ℓ^{'} .$ We set the variable of interest to be $y_{6}$ and consider different degrees of correlation between its values and those taken by $θ_{k},$ namely 0.3, 0.5, 0.8. The values of $θ_{k}$ are afterwards standardized to have mean 0 and variance 1.

The response probabilities are obtained by first computing

$p_{k}^{\circ} = 1 / [1 + \exp (- (0.5 + y_{k 1} + θ_{k}))], for k = 1, \dots, N, (6.3)$

and then rescaling them to take values between 0.1 and 0.9, with a population mean approximatively equal to 0.7.

The item response probabilities are generated by first computing

$q_{k ℓ}^{\circ} = 1 / (1 + \exp (- (b_{ℓ} θ_{k} + a_{ℓ} + y_{k ℓ}))), for k = 1, \dots, N and ℓ = 1, \dots,6, (6.4)$

where ${a_{ℓ}}_{ℓ = 1, \dots,6} = {1,0, - 0 .5,1,0, - 0 .5}$ and ${b_{ℓ}}_{ℓ = 1, \dots,6} = {1,1,1, 1 .5, 1 .5, 1 .5},$ and then rescaling the values to be between 0.1 and 0.95.

We draw $M = 10,000$ samples by simple random sampling without replacement of size $n = 200 .$ For each sample $s,$ a response set $r$ is created by carrying out Poisson sampling with parameter $p_{k}$ defined in (6.3). Each element of the matrix ${x_{k ℓ}}_{k \in r, ℓ = 1, \dots,6}$ is generated using Poisson sampling with parameter $q_{k ℓ}$ defined in (6.4). Item nonresponse rates over simulations take approximately value 18%, 28%, 35%, 19%, 29%, 34%, for $ℓ = 1, \dots,6,$ respectively. For each simulation run, Model (4.2) is used to compute the variable ${\hat{θ}}_{k}$ for all $k \in s .$ Model (4.4) is then fitted to obtain ${\hat{p}}_{k} .$

Table 6.2
Simulation results for setting 2 - Simulated continuous data
Table summary
This table displays the results of Simulation results for setting 2 - Simulated continuous data. The information is grouped by Estimator (appearing as row headers), B, $\sqrt{VAR}$ , MSE and % RB (appearing as column headers).
Estimator	B	$\sqrt{VAR}$	MSE	% RB
correlation coefficient 0.3
$HT$	-0.7	131.6	17,331.2	$\approx$ -0.0
${\hat{Y}}_{j, naive}$	825.6	177.1	713,039.3	41.0
${\hat{Y}}_{j, p q}$	-227.4	188.0	87,033.0	-11.3
${\hat{Y}}_{j, p q, true}$	48.4	231.8	56,073.2	2.4
correlation coefficient 0.5
$HT$	0.1	135.0	18,220.5	$\approx$ 0.0
${\hat{Y}}_{j, naive}$	972.6	176.2	977,009.5	50.7
${\hat{Y}}_{j, p q}$	-180.0	175.5	63,552.0	-9.4
${\hat{Y}}_{j, p q, true}$	74.8	212.7	50,844.0	3.9
correlation coefficient 0.8
$HT$	-0.1	134.1	17,992.0	$\approx$ -0.0
${\hat{Y}}_{j, naive}$	1,154.6	168.1	1,361,388.1	57.7
${\hat{Y}}_{j, p q}$	-184.8	164.4	61,173.0	-9.2
${\hat{Y}}_{j, p q, true}$	100.6	196.2	48,597.9	5.0

Table 6.2 reports on the performance of the estimators for the three values taken by the nominal correlation coefficient between $y_{k 1}$ and $θ_{k} :0 .3, 0 .5$ and $0 .8 .$ The proposed estimator is always able to reduce bias over the naive estimator, even when the correlation between the variable of interest and the latent variable gets smaller. The relative bias takes acceptable values in most cases. Bias deserves a closer look. The naive estimator in all cases largely overestimates the total. This is expected, because the values $p_{k}, q_{k 6}, θ_{k}$ and $y_{k 6}$ all go in the same direction. Therefore, in our respondents sample, we are more likely to find relative larger values for $y_{6}$ by this providing overestimation for the naive estimator. On the other hand, ${\hat{Y}}_{j, p q}$ underestimates the total because it is based only on the observed units of $r_{j}$ that do have relatively large values for $y_{6},$ but also relatively large values for $p_{k}$ and $q_{k 6}$ and, therefore, end up having a small weight.

The matrix of population values ${x_{k ℓ}}_{k = 1; \dots; 2,000; ℓ = 1, \dots,6}$ is constructed in the same way as in Section 6.1 to validate the assumptions behind the 2PL model. The Cronbach’s alpha takes approximately value 0.5 for the correlation coefficient equal to 0.3, 0.6 for 0.5, and 0.7 for 0.8; the pairwise association between the six items reveals $p -$ values smaller than 0.01. Inspection of the two-way and three-way margins of the matrix ${x_{k ℓ}}$ gives residuals ${(O - E)}^{2} / E$ that all take values smaller than 4. Therefore, the one factor latent model can be accepted and items all seem to be measuring the same latent trait.

Previous | Next

Date modified:: 2015-11-27

Language selection

Search and menus

Search

Publications

Survey Methodology

Browse by

6. Simulation studies

6.1 Simulation setting 1

6.2 Simulation setting 2