Publications

Survey Methodology

Browse by

5 Numerical studies

Chen Xu, Jiahua Chen and Harold Mantel

To evaluate the finite sample performance of PPL-BIC, extensive numerical studies have been conducted using data from the Survey on Living with Chronic Diseases in Canada (SLCDC; Statistics Canada 2009). In particular, we compare the proposed procedure with classic non-survey methods based on regression models postulated between SLCDC variables and hypothetical (simulated) responses. We tentatively reveal some insights for using pseudo-likelihood-based selection under two simulation scenarios. In the first scenario, populations are generated from presumed models and samples are obtained by designs that potentially create spurious correlations among SLCDC variables. In the second scenario, populations are not accurately generated from presumed models and samples are obtained by a design related to both response and candidate covariates. Also, we report the analysis of the original SLCDC 2009 data as an example for using PPL-BIC in real applications.

5.1 SLCDC data

SLCDC is a cross-sectional study sponsored by the Public Health Agency of Canada that collects information related to the experiences of Canadians with chronic health conditions. One of the main objectives of SLCDC is to identify health behavior that influences disease outcomes, so that the government can better plan and provide health services for people with chronic diseases.

SLCDC takes place every two years, with two chronic diseases covered in each survey cycle. The 2009 survey focused on arthritis and hypertension. We restrict our attention to hypertension. The target population for the hypertension survey is Canadians aged twenty years or older from the ten provinces who have been diagnosed with hypertension and who live in private dwellings. To facilitate the survey process, the sampling units of SLCDC 2009 are people with hypertension who completed the 2008 Canadian community health survey (CCHS). For the purpose of SLCDC, the population is first stratified according to the CCHS respondents based on sex and four age groups: 20-44, 45-64, 65-75, and 75+. Therefore, the finite population formed by the CCHS respondents was divided into 8 categories, age (4 levels) by sex (2 levels). A stratified sampling plan is used for SLCDC with proportional sample size allocation. An overall sample of 9,005 was selected from the 17,437 CCHS respondents, and 6,142 respondents completed the SLCDC survey.

We identified 40 variables relevant to hypertension based on the original SLCDC data, among which 7 variables have complete information on all 6,142 respondents. The remaining 33 variables have some amount of missing values due to the non-responses in the original questionnaire (see Table A.1 in Appendix for the list of variables and corresponding non-response rates). There was no obvious systematic reason for the item non-response. The variable with most severe missingness is INCDRPR (household income) with a 9.6% non-response rate, while the amount of missing data is relatively minor for the remaining variables. To facilitate the analysis, we used simple imputation methods for the missing data as follows. For a categorical variable, we imputed the non-response value by a random value from the response set; for a continuous variable, we imputed the non-response value by the mean value of the responses. Two exceptions for above imputation are variables BMHX_02 and CNHX_05. The former one acts as the response variable of the regression model in the later data analysis, while the later one has natural restrictions on the range of its value. Instead, we removed the 274 observations with missing values in these two variables, which results in the basic working data with 5,868 observations. The imputation/removal procedure does not have any effect on evaluating the BIC procedure based on simulated population. It could bias the analysis of the real data. Yet given the low rate of missingness, and plausibility of missing at random in the specific case, the conclusion is unlikely to be severely affected.

Since the SLCDC is a follow-up to the CCHS, the sampling weights for SLCDC were initially obtained from the weights of the CCHS data. The weights were then adjusted to ensure that the SLCDC respondents represent the target population. Consequently, the adjusted weights show considerable variation between sampled units. After scaling by $k = n / N \approx 10^{- 3},$ the adjusted weights vary between 0.01 to 33.62 with an inter-quartile range of 0.76.

5.2 Scenario 1: Spurious correlation

As mentioned, in complex survey designs, the correlation structure between variables reflected in the sample can be distorted from the population. In the first simulation scenario, we assess the purposed BIC method when data are collected through designs that potentially create spurious correlations between candidate covariates. Specifically, we treat the 40 identified variables as candidate covariates for some hypothetical response $Y,$ and index them as $X_{1}$ to $X_{40}$ for simplicity. We consider both continuous and binary responses in our simulations. For the continuous cases, we generate the values of $Y$ according to

Model 1 : $Y = 0, 7 X_{6} + 0, 7 X_{10} + 0, 6 X_{18} - 0, 6 X_{22} + ε,$

Model 2 : $Y = 0, 7 X_{6} + 0, 6 X_{10} + 0, 6 X_{18} - 0, 5 X_{22} + 0, 3 X_{30} - 0, 3 X_{34} + ε,$

with $ε \sim N (0,1) .$ For the binary cases where $Y \in {0,1},$ we generate the values of $Y$ according to the logistic models

Model 3 : $logit (Pr {Y = 1 | X}) = 0, 7 X_{7} - 0, 6 X_{8} + 0, 5 X_{26},$

Model 4 : $logit (Pr {Y = 1 | X}) = 0, 8 X_{7} - 0, 7 X_{8} + 0, 6 X_{26} - 0, 5 X_{28} + 0, 4 X_{36} .$

The specified models include one of the strata identifiers in SLCDC (i.e., $X_{6}$ or $X_{7}$ ) with a nested structure for each modeling context.

The finite population used in the simulation was created as follows. The basic working data of 5,868 respondents was duplicated 10 times proportional to the rounded integer values of SLCDC weights, which results in a pseudo-finite population of size 55,950 with complete information on $X_{1}, \dots, X_{40} .$ The values of response $Y$ were then generated based on Models 1-4 respectively. We consider the variable selection problem to be the identification of the postulated model that generates the values of $Y .$

We investigate the performance of proposed procedure under two stratified sampling plans. Specifically, we create 4 strata based on variables $X_{6}$ (age, 55-/55+) and $X_{7}$ (sex, Male/Female), which leads to the group (Female, 55-) of size 7,120, group (Female, 55+) of size 19,199, group (Male, 55-) of size 6,187, and group (Male, 55+) of size 23,458. In the first plan, a simple random sampling without replacement (SRSWOR) with equally allocated sample size is drawn from each stratum. The inference is made based on the 4-four SRSWORs pooled together. In the second plan, we further construct three subgroups within each stratum based on the sum of two binary covariates of the postulated models. In particular, the subgroups are built according to $X_{18} + X_{22}$ for data generated by Models 1-2, while the subgroups are similarly construct based on $X_{8} + X_{26}$ for data from Models 3-4. We then make inference based on SRSWORs drawn from each sub-group of the 4-four strata. The overall sample size is equally allocated at the stratum level with a 2:1:2 proportion for the three subgroups within a same stratum. A simple Monte Carlo computation reveals that the sample correlation between $X_{18}$ and $X_{22}$ (for data from Models 1-2) can be as high as 0.5, whereas their population-based correlation is merely around 0.02. Similar phenomenon is also observed between $X_{8}$ and $X_{26}$ (for data from Models 3-4). We therefore expect variable selection under the second sampling plan to be more challenging due to this systematic inflation. In the simulations, we set the overall sample size $n =$ 500 for Models 1-2 and $n =$ 1,500 for Models 3-4. A summary of influential variables to the response and the design variables affecting the sampling probabilities can be found in Appendix (Table A.2).

The PPL-BIC selection procedure was carried out on probability samples obtained from the finite population. In particular, we scaled the survey weights as mentioned in (3.2) and chose the SCAD penalty for the penalized pseudo-likelihood function (3.5). The corresponding maximizer of (3.5) was solved by using the thresholding-based iterative algorithm (She 2011). For comparison purpose, the ideas of AIC and GCV are also used as alternatives for the proposed BIC (3.4). Based on the discussion in Section 3, we define the pseudo-likelihood-based AIC and GCV as

${AIC}_{n} (s) = - 2 l_{n} ({\hat{β}}_{s}) + 2 τ (s),$

${GCV}_{n} (s) = - \frac{1}{n} \frac{l_{n} ({\hat{β}}_{s})}{{(1 - τ (s) / n)}^{2}},$

which are similarly implemented though the PPL-based procedure. Moreover, for each setup, we repeat the selection procedure with all survey weights ignored (being set as unity). The unweighted selection results are corresponding to pure model-based inferences as discussed in Section 2. In particular, the pseudo-likelihood-based BIC reduces to the classic BIC (3.1) used for non-survey situations.

In Tables 5.1-5.2, we summarize the simulation results based on 1,000 repetitions in terms of the positive selection rate (PSR), false discovery rate (FDR), correct selection rate (CSR), and averaged model size (AMS). Specifically, let $s_{0}$ be the true model that generates the finite population and $s_{j}^{'}$ be the selected model based on the $j^{th}$ sample, $j = 1, \dots,$ 1,000. The PSR, FDR, CSR and AMS are estimated as

$PSR = \frac{\sum_{j = 1}^{1, 000} τ (s_{0} \cap s_{j}^{'})}{1,000 τ (s_{0})}, FDR = \frac{\sum_{j = 1}^{1, 000} τ (s_{j}^{'} / s_{0})}{1,000 τ (s_{j}^{'})},$

$CSR = \frac{\sum_{j = 1}^{1, 000} I (s_{j}^{'} = s_{0})}{1,000}, AMS = \frac{\sum_{j = 1}^{1, 000} τ (s_{j}^{'})}{1,000},$

where $τ (s)$ denotes the size of model $s$ and $I (\cdot)$ is the indicator function. In addition, we assess the predictive accuracy of the selected model as follows. For each setup, a test sample of size 200 is generated by SRSWOR from the same finite population as that for the training sample. For Models 1-2, we use the averaged residual sum of squares (RSS) on the test data as a measurement of the predictive ability of the selected model. For Models 3-4, we compute both positive and negative prediction rates. To be specific, let $π^{*}$ be a specified benchmark and ${\hat{π}}_{i}$ be the estimated success probability of the $i^{th}$ test sample, $i = 1, \dots,200.$ We then predict the $i^{th}$ response $y_{i}$ by ${\hat{y}}_{i} = 1$ if ${\hat{π}}_{i} > π^{*}$ and ${\hat{y}}_{i} = 0$ otherwise. The correct prediction rates are estimated by

$PPR = \frac{\sum_{i \in {i : y_{i} = 1}} I ({\hat{y}}_{i} = 1)}{\sum_{i = 1}^{200} I (y_{i} = 1)}, NPR = \frac{\sum_{i \in {i : y_{i} = 0}} I ({\hat{y}}_{i} = 0)}{\sum_{i = 1}^{200} I (y_{i} = 0)} .$

The final PPR and NPR are averaged based on 1,000 replications. Note that PPR and NPR here are similar to sensitivity and specificity in the clinical studies, which indicate the ability of a 0-1 prediction approach in terms of correct positive and negative predictions. In general, a larger $π^{*}$ leads to high NPR but low PPR. The value of $π^{*}$ should be cautiously specified in applications. In our simulation studies, we fix $π^{*} =$ 0.5 for simplicity.

The results are encouraging for the proposed BIC method. From Tables 5.1-5.2, we observe that models selected by AIC have both high PSR and FDR, which indicates an excessive inclusion of the irrelevant variables. In comparison, the BIC significantly reduces the FDR of selected models with a slight sacrifice on PSR, and selects the model with sizes closer to the truth. Although the GCV behaves similarly to BIC in the linear model settings, it concurs with AIC for the logistic models where less information is provided from the binary responses.

In the first sampling plan, the inclusion probabilities are related to $Y$ only through a single covariate in the model (i.e., $X_{6}$ or $X_{7}$ ). The sample correlation structure between the response and covariates is largely maintained from the finite population. Consequently, no substantial difference is observed between the weighted and unweighted selection procedures from Table 5.1.

Table 5.1
Selection for the design not generating strong spurious correlations (1st plan). Results are summarized in terms of positive selection rate (PSR), false discovery rate (FDR), correct selection rate (CSR) and averaged model size (AMS); Prediction assessments for Models 1-2 are based on the testing residual sum of squares (RSS), while for Models 3-4 they are based on positive/negative prediction rate (PPR, NPR) with a benchmark 0.5.
Table summary
This table displays selection for the design not generating strong spurious correlations. The information is grouped by Weights (appearing as row headers), criterion, PSR, FDR, AMS, prediction (appearing as column headers).
Weights	Criterion	PSR	FDR	CSR	AMS	Prediction
Model 1
Ignored	GCV	0.96	0.19	0.28	4.9	1.04
	AIC	0.99	0.48	0.05	8.7	1.08
	BIC	0.96	0.19	0.28	4.9	1.04
Included	GCV	0.95	0.24	0.19	5.2	1.05
	AIC	0.99	0.61	0.01	11.4	1.11
	BIC	0.95	0.24	0.20	5.3	1.05
Model 2
Ignored	GCV	0.72	0.19	0.02	5.5	1.07
	AIC	0.89	0.44	0.01	10.3	1.09
	BIC	0.73	0.19	0.03	5.6	1.07
Included	GCV	0.74	0.24	0.02	6.1	1.08
	AIC	0.89	0.54	0.01	12.5	1.12
	BIC	0.74	0.24	0.03	6.1	1.08
Model 3
Ignored	GCV	0.99	0.59	0.00	7.8	(0.71, 0.45)
	AIC	0.99	0.62	0.00	8.4	(0.69, 0.49)
	BIC	0.96	0.43	0.00	5.1	(0.72, 0.44)
Included	GCV	0.99	0.67	0.00	9.9	(0.71, 0.47)
	AIC	0.99	0.70	0.00	10.7	(0.68, 0.48)
	BIC	0.94	0.45	0.00	5.3	(0.71, 0.45)
Model 4
Ignored	GCV	0.97	0.44	0.01	9.4	(0.66, 0.55)
	AIC	0.98	0.47	0.01	9.8	(0.65, 0.56)
	BIC	0.87	0.26	0.07	6.0	(0.69, 0.53)
Included	GCV	0.98	0.54	0.01	11.4	(0.66, 0.54)
	AIC	0.98	0.56	0.00	11.9	(0.66, 0.55)
	BIC	0.86	0.30	0.05	6.2	(0.68, 0.53)

The insights of using sampling weights in variable selection are tentatively revealed from the second sampling plan, where the sample correlation structure is systemically distorted. Clearly, the spurious correlation between covariates in the sampled units deteriorates the efficiency of selection methods. This is reflected from the depressed PSRs and the inflated FDRs from the unweighted procedures. Incorporating sampling weights in the selecting process is helpful to correct the biased result. In particular, noticeable improvements have been observed for the BIC-based selection. In the most impressive case (i.e., Model 3 of Table 5.2), the pseudo-likelihood-based BIC substantially improves the classic BIC by increasing the PSR from 0.65 up to 0.89, while reduces the corresponding FDR from 0.62 down to 0.50. Our observation echoes the rationale of weighting as the removal of bias due to the informative sampling (Section 6.3, Fuller 2009).

Table 5.2
Selection for the design generating strong spurious correlations (2nd plan). Results are summarized in terms of positive selection rate (PSR), false discovery rate (FDR), correct selection rate (CSR) and averaged model size (AMS); Prediction assessments for Models 1-2 are based on the testing residual sum of squares (RSS), while for Models 3-4 they are based on positive/negative prediction rate (PPR, NPR) with a benchmark 0.5.
Table summary
This table displays selection for the design generating strong spurious correlations. The information is grouped by Weights (appearing as row headers), criterion, PSR, FDR, AMS, prediction (appearing as column headers).
Weights	Criterion	PSR	FDR	CSR	AMS	Prediction
Model 1
Ignored	GCV	0.83	0.23	0.17	4.6	1.09
	AIC	0.97	0.49	0.04	8.6	1.10
	BIC	0.83	0.23	0.17	4.6	1.09
Included	GCV	0.95	0.31	0.13	5.9	1.07
	AIC	0.99	0.65	0.00	12.5	1.12
	BIC	0.95	0.30	0.14	5.9	1.07
Model 2
Ignored	GCV	0.62	0.22	0.02	5.0	1.13
	AIC	0.88	0.45	0.01	10.3	1.14
	BIC	0.62	0.22	0.02	5.1	1.12
Included	GCV	0.72	0.28	0.01	6.5	1.10
	AIC	0.89	0.59	0.00	13.7	1.12
	BIC	0.72	0.27	0.01	6.5	1.10
Model 3
Ignored	GCV	0.87	0.62	0.00	7.3	(0.66, 0.44)
	AIC	0.88	0.63	0.00	7.6	(0.65, 0.45)
	BIC	0.65	0.62	0.00	4.5	(0.68, 0.42)
Included	GCV	0.97	0.74	0.00	11.9	(0.70, 0.46)
	AIC	0.97	0.75	0.00	12.4	(0.68, 0.46)
	BIC	0.89	0.50	0.00	5.6	(0.70, 0.44)
Model 4
Ignored	GCV	0.94	0.48	0.00	9.5	(0.62, 0.51)
	AIC	0.95	0.50	0.00	10.0	(0.62, 0.52)
	BIC	0.72	0.41	0.00	6.1	(0.64, 0.49)
Included	GCV	0.93	0.61	0.00	12.5	(0.64, 0.53)
	AIC	0.94	0.62	0.00	12.9	(0.64, 0.53)
	BIC	0.82	0.34	0.01	6.4	(0.67, 0 ,54)

5.3 Scenario 2: Model mis-specification

A well-known rationale for using sampling weights is that it provides protection against model mis-specification (Pfeffermann and Holmes 1985; Kott 1991): the inferences based on weighted estimates may remain valid for the surveyed population, even when the model fails. To gain further insights of weighting in variable selection, we further compare the proposed BIC with the classic unweighted methods in the simulation where the presumed model is misspecified from the model that generates the data. In such situations, a postulated "true� model does not exist, and the goal of variable selection is to find an optimal model that well describes the finite population. We still make use of the stratified pseudo-finite population in Section 5.2, but generate the response variable $Y$ according to the strata. Specifically, the values of $Y$ for units in strata (Male, 55+) and (Female, 55+) were generated by

$Y = 0.6 X_{6} + 0.4 X_{18} + 0.4 X_{20} + 0.6 X_{38} + ε,$

while the values $Y$ for units in the strata (Male, 55-) and (Female, 55-) were generated by

$Y = 0.6 X_{6} + 0.4 X_{18} + 0.4 X_{20} + ε$

with $ε \sim N (0,1)$ denoting a random error. In other words, we assume that variable $X_{38}$ is influential only for people aged 55 and older, but not for people younger than 55. In addition, we further violate the presumed model 1 by excluding $X_{6}$ from the set of candidate covariates, which mimics the situation where one important design feature is omitted in the modeling. A stratified SRSWOR of size 500 or 1,000 is drawn using the first sampling plan in Section 5.2. The weighted and unweighted procedures are then tested for the variable selection based on the sampled units.

We summarize the simulation results in Table 5.3 by estimating the selection rates of $X_{18},$ $X_{20},$ and $X_{38}$ based on 1,000 replications. Similar to the previous simulations, the averaged model size (AMS) and the testing RSS of selected models (i.e., the averaged RSS based on testing data of size 200) are also included in the summary. From Table 5.3, we see that when the model assumption is violated, the pseudo-likelihood-based BIC still achieves relatively high prediction accuracy by suggesting relevant variables with high probability. In contrast, ignoring the survey weights leads to nearly 9% relative loss on the testing RSS because of the exclusion of $X_{38} .$ Apparently, increasing the sample size helps to improve the goodness of fit for the misspecified models, yet the improvement is at a cost by including more variables.

Table 5.3
Selection frequency of influential variables in model mis-specified case; The averaged model size (AMS) and the testing residual sum of squares (RSS) are also reported.
Table summary
This table displays the selection frequency of influential variables in model mis-specified case. The information is grouped by Weights (appearing as row headers), Criterion, X₁₈, X₂₀, X₃₈, AMS, Testing RSS(appearing as column headers).
Weights	Criterion	X₁₈	X₂₀	X₃₈	AMS	Testing RSS
n = 500
Ignored	GCV	0.78	0.95	0.56	5.9	1.93
	AIC	0.95	0.99	0.73	12.5	1.95
	BIC	0.83	0.97	0.60	6.6	1.93
Included	GCV	0.73	0.92	0.84	6.3	1.77
	AIC	0.91	0.99	0.85	12.5	1.79
	BIC	0.78	0.94	0.83	6.9	1.77
n = 1 000
Ignored	GCV	0.96	1.00	0.79	7.6	1.87
	AIC	0.99	1.00	0.87	13.1	1.88
	BIC	0.97	1.00	0.80	7.9	1.87
Included	GCV	0.93	1.00	0.94	7.6	1.71
	AIC	0.98	1.00	0.96	13.0	1.72
	BIC	0.94	1.00	0.94	7.7	1.71

5.4 Analysis of SLCDC data

To illustrate the application of proposed BIC, we use it to identify health behaviors that affect the control of blood pressure using SLCDC 2009. The response variable is BMHX_02 from the working data obtained from SLCDC, which has 2 levels indicating whether or not the blood pressure of the respondent is under control, based on the latest measurement by a health professional. We treat the remaining 39 variables in the working data as candidate covariates, and our goal is to identify the influential covariates that are associated with blood-pressure control. We build a logistic regression of BMHX_02 on the candidate covariates and use PPL-BIC with SCAD penalty to select the influential ones (weights are scaled by $k = 10^{- 3}$ ). As a preliminary step, each covariate is standardized such that the corresponding first and second weighted sample moments are zero and unity respectively. For comparison, the AIC and GCV are also used in the analysis.

In Figure 5.1, we plot the scores of criterion with respect to the degree of model sparsity. We see that the BIC selects a model with 11 covariates, while the GCV and AIC pick the same model with 24 covariates. When survey weights are ignored in the selection procedure, models with 7 or 21 covariates are suggested based on the standard BIC or GCV/AIC. The distinction between the weighted and unweighted selection results reflects the potential distortion in the correlation structure of the sampled units. Such a distinction may also be explained by model mis-specification for part of the SLCDC population (Lohr and Liu 1994). Given the potential bias for unweighted methods, the weighted selection results are more plausible in the analysis.

Description for figure 5.1

Figure 5.1 Selection criteria values based on candidate models

We further assess the selected models in terms of predictive accuracy as follows. First, we draw 500 independent sets of 5,868 bootstrap samples (with replacement) from the working data of SLCDC. For the $t^{th}$ bootstrap sample $d_{t},$ $t = 1, \dots,500,$ the survey weight $w_{i}$ for the $i^{th}$ unit is adjusted by ${\tilde{w}}_{t i} = v_{t i} w_{i}$ with $v_{t i}$ denoting the number of times that the $i^{th}$ unit is selected in $d_{t} .$ We then fit the selected models to each bootstrap sample (with weights accounted accordingly), and evaluate their weighted positive and negative prediction rates (WPPR, WNPR) by

$WPPR = \frac{\sum_{i \in d_{t}} w_{i} I ({\hat{y}}_{i} = 1, y_{i} = 1)}{\sum_{i \in d_{t}} w_{i} I (y_{i} = 1)}, WNPR = \frac{\sum_{i \in d_{t}} w_{i} I ({\hat{y}}_{i} = 0, y_{i} = 0)}{\sum_{i \in d_{t}} w_{i} I (y_{i} = 0)},$

where $y_{i}$ and ${\hat{y}}_{i}$ denote the $i^{th}$ response in BMHX_02 and its predicted value. We summarize the averaged WPPR and WNPR based on 500 bootstrap samples in Table 5.4 according to three different benchmark values (i.e., 0.25, 0.35, 0.45).

From Table 5.4, we observe that the models selected from unweighted analysis have lower WPPR in general, which provides additional support for using survey weights in the selection procedure. Compared with GCV/AIC, the BIC selects the model with a slightly conservative WPPR but a higher WNPR. Nevertheless, the difference is not significant. Noticeably, the size of BIC-selected model is much less than the GCV/AIC selected one, which provides an easier interpretation between the response BMHX_02 and the covariates.

Table 5.4
Prediction accuracy for selected models: (WPPR, WNPR) based on different benchmarks.
Table summary
This table displays prediction accuracy for selected models. The information is grouped by weights (appearing as row headers), criterion, ≥0.25, ≥0.35, ≥0.45 (appearing as column headers).
Weights	Criterion	≥0.25	≥0.35	≥0.45
Ignored	AIC/GCV	(0.646, 0.525)	(0.460, 0.688)	(0.299, 0.811)
Ignored	BIC	(0.649, 0.513)	(0.445, 0.705)	(0.265, 0.818)
Included	AIC/GCV	(0.645, 0.523)	(0.488, 0.682)	(0.338, 0.790)
Included	BIC	(0.654, 0.532)	(0.485, 0.706)	(0.322, 0.830)

To assess the stability of selection, we repeat the weighted selection procedure based on the 500 bootstrap samples. In Table 5.5, we list the bootstrap selection rate for the seven most significant covariates according to their MLE in the original SLCDC working data. The corresponding coefficient estimates and standard errors are also included based on the bootstrap samples. From Table 5.5, we find that only four significant covariates (i.e., DHHX_AGE, GENXDHMH, INHX_06, HWTDBMI) are consistently selected by BIC, while the GCV/AIC tends to pick more unreliable ones in the model. The BIC-based selection result suggests that the control of blood pressure is strongly associated with age, body weights, mental health and the medication information. Our observation echoes many hypertension studies in the literature (see, e.g. Gelber, Gaziano, Manson, Buring and Sesso 2007; Yan, Liu, Matthews, Daviglus, Ferguson and Kiefe 2003).

Table 5.5
Bootstrap selection results for significant variables: (Estimated coefficient, Standard error, Selection rate).
Table summary
This table displays bootstrap selection results for significant variables. The information is grouped by variable (appearing as row headers), GCV, AIC, BIC (appearing as column headers).
Variable	GCV	AIC	BIC
GEO_ON	(0.14, 0.09, 0.86)	(0.16, 0.09, 0.92)	(0.09, 0.09, 0.58)
DHHX_AGE	(-0.29, 0.09, 1.0)	(-0.32, 0.09, 1.0)	(-0.27, 0.08, 1.0)
GENXDHMH	(-0.15, 0.05, 0.99)	(-0.15, 0.05, 0.99)	(-0.14, 0.06, 0.92)
SMHXDSLT	(0.11, 0.07, 0.76)	(0.12, 0.07, 0.84)	(0.08, 0.09, 0.47)
MOHXDBPM	(-0.08, 0.07, 0.67)	(-0.09, 0.06, 0.81)	(-0.05, 0.07, 0.35)
INHX_06	(0.18, 0.06, 0.97)	(0.18, 0.06, 0.99)	(0.18, 0.07, 0.91)
HWTDBMI	(0.14, 0.06, 0 ,95)	(0.14, 0.06, 0 ,97)	(0.13, 0.06, 0 ,91)
Ave. Model Size	23.1	27.8	10.3

Previous | Next

Date modified:: 2017-09-20

Language selection

Search and menus

Search