5 Numerical studies

Chen Xu, Jiahua Chen and Harold Mantel

Previous | Next

To evaluate the finite sample performance of PPL-BIC, extensive numerical studies have been conducted using data from the Survey on Living with Chronic Diseases in Canada (SLCDC; Statistics Canada 2009). In particular, we compare the proposed procedure with classic non-survey methods based on regression models postulated between SLCDC variables and hypothetical (simulated) responses. We tentatively reveal some insights for using pseudo-likelihood-based selection under two simulation scenarios. In the first scenario, populations are generated from presumed models and samples are obtained by designs that potentially create spurious correlations among SLCDC variables. In the second scenario, populations are not accurately generated from presumed models and samples are obtained by a design related to both response and candidate covariates. Also, we report the analysis of the original SLCDC 2009 data as an example for using PPL-BIC in real applications.

5.1 SLCDC data

SLCDC is a cross-sectional study sponsored by the Public Health Agency of Canada that collects information related to the experiences of Canadians with chronic health conditions. One of the main objectives of SLCDC is to identify health behavior that influences disease outcomes, so that the government can better plan and provide health services for people with chronic diseases.

SLCDC takes place every two years, with two chronic diseases covered in each survey cycle. The 2009 survey focused on arthritis and hypertension. We restrict our attention to hypertension. The target population for the hypertension survey is Canadians aged twenty years or older from the ten provinces who have been diagnosed with hypertension and who live in private dwellings. To facilitate the survey process, the sampling units of SLCDC 2009 are people with hypertension who completed the 2008 Canadian community health survey (CCHS). For the purpose of SLCDC, the population is first stratified according to the CCHS respondents based on sex and four age groups: 20-44, 45-64, 65-75, and 75+. Therefore, the finite population formed by the CCHS respondents was divided into 8 categories, age (4 levels) by sex (2 levels). A stratified sampling plan is used for SLCDC with proportional sample size allocation. An overall sample of 9,005 was selected from the 17,437 CCHS respondents, and 6,142 respondents completed the SLCDC survey.

We identified 40 variables relevant to hypertension based on the original SLCDC data, among which 7 variables have complete information on all 6,142 respondents. The remaining 33 variables have some amount of missing values due to the non-responses in the original questionnaire (see Table A.1 in Appendix for the list of variables and corresponding non-response rates). There was no obvious systematic reason for the item non-response. The variable with most severe missingness is INCDRPR (household income) with a 9.6% non-response rate, while the amount of missing data is relatively minor for the remaining variables. To facilitate the analysis, we used simple imputation methods for the missing data as follows. For a categorical variable, we imputed the non-response value by a random value from the response set; for a continuous variable, we imputed the non-response value by the mean value of the responses. Two exceptions for above imputation are variables BMHX_02 and CNHX_05. The former one acts as the response variable of the regression model in the later data analysis, while the later one has natural restrictions on the range of its value. Instead, we removed the 274 observations with missing values in these two variables, which results in the basic working data with 5,868 observations. The imputation/removal procedure does not have any effect on evaluating the BIC procedure based on simulated population. It could bias the analysis of the real data. Yet given the low rate of missingness, and plausibility of missing at random in the specific case, the conclusion is unlikely to be severely affected.

Since the SLCDC is a follow-up to the CCHS, the sampling weights for SLCDC were initially obtained from the weights of the CCHS data. The weights were then adjusted to ensure that the SLCDC respondents represent the target population. Consequently, the adjusted weights show considerable variation between sampled units. After scaling by k=n/N 10 3 , MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Gaam4Aaiabg2da9maalyaabaGaamOBaaqaaiaad6eaaaGaeyisISRa aGymaiaaicdadaahaaWcbeqaaiabgkHiTiaaiodaaaGccaGGSaaaaa@4560@  the adjusted weights vary between 0.01 to 33.62 with an inter-quartile range of 0.76.

5.2  Scenario 1: Spurious correlation

As mentioned, in complex survey designs, the correlation structure between variables reflected in the sample can be distorted from the population. In the first simulation scenario, we assess the purposed BIC method when data are collected through designs that potentially create spurious correlations between candidate covariates. Specifically, we treat the 40 identified variables as candidate covariates for some hypothetical response Y, MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamywaiaacYcaaaa@3D65@  and index them as X 1 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamiwamaaBaaaleaacaaIXaaabeaaaaa@3D9B@  to X 40 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamiwamaaBaaaleaacaaI0aGaaGimaaqabaaaaa@3E58@  for simplicity. We consider both continuous and binary responses in our simulations. For the continuous cases, we generate the values of Y MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Gaamywaaaa@3CB5@  according to 

Model 1 : Y=0,7 X 6 +0,7 X 10 +0,6 X 18 0,6 X 22 +ε, MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Gaamywaiabg2da9iaaicdacaGGSaGaaG4naiaadIfadaWgaaWcbaGa aGOnaaqabaGccqGHRaWkcaaIWaGaaiilaiaaiEdacaWGybWaaSbaaS qaaiaaigdacaaIWaaabeaakiabgUcaRiaaicdacaGGSaGaaGOnaiaa dIfadaWgaaWcbaGaaGymaiaaiIdaaeqaaOGaeyOeI0IaaGimaiaacY cacaaI2aGaamiwamaaBaaaleaacaaIYaGaaGOmaaqabaGccqGHRaWk cqaH1oqzcaGGSaaaaa@55C5@

Model 2 : Y=0,7 X 6 +0,6 X 10 +0,6 X 18 0,5 X 22 +0,3 X 30 0,3 X 34 +ε, MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Gaamywaiabg2da9iaaicdacaGGSaGaaG4naiaadIfadaWgaaWcbaGa aGOnaaqabaGccqGHRaWkcaaIWaGaaiilaiaaiAdacaWGybWaaSbaaS qaaiaaigdacaaIWaaabeaakiabgUcaRiaaicdacaGGSaGaaGOnaiaa dIfadaWgaaWcbaGaaGymaiaaiIdaaeqaaOGaeyOeI0IaaGimaiaacY cacaaI1aGaamiwamaaBaaaleaacaaIYaGaaGOmaaqabaGccqGHRaWk caaIWaGaaiilaiaaiodacaWGybWaaSbaaSqaaiaaiodacaaIWaaabe aakiabgkHiTiaaicdacaGGSaGaaG4maiaadIfadaWgaaWcbaGaaG4m aiaaisdaaeqaaOGaey4kaSIaeqyTduMaaiilaaaa@60F8@

with εN( 0,1 ). MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaeqyTduweeuuDJXwAKbsr4rNCHbacfaGae8hpIOJaamOtamaabmaa baGaaGimaiaaiYcacaaIXaaacaGLOaGaayzkaaGaaiOlaaaa@4870@  For the binary cases where Y{ 0,1 }, MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamywaiabgIGiopaacmqabaGaaGimaiaaiYcacaaIXaaacaGL7bGa ayzFaaGaaiilaaaa@4346@  we generate the values of Y MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Gaamywaaaa@3CB5@  according to the logistic models 

Model 3 : logit( Pr{ Y=1|X } )=0,7 X 7 0,6 X 8 +0,5 X 26 , MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaaeiBaiaab+gacaqGNbGaaeyAaiaabshadaqadaqaaiaabcfacaqG YbWaaiWaaeaacaWGzbGaeyypa0ZaaqGaaeaacaaIXaaacaGLiWoaca WHybaacaGL7bGaayzFaaaacaGLOaGaayzkaaGaeyypa0JaaGimaiaa cYcacaaI3aGaamiwamaaBaaaleaacaaI3aaabeaakiabgkHiTiaaic dacaGGSaGaaGOnaiaadIfadaWgaaWcbaGaaGioaaqabaGccqGHRaWk caaIWaGaaiilaiaaiwdacaWGybWaaSbaaSqaaiaaikdacaaI2aaabe aakiaacYcaaaa@5B58@

Model 4 : logit( Pr{ Y=1|X } )=0,8 X 7 0,7 X 8 +0,6 X 26 0,5 X 28 +0,4 X 36 . MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaaeiBaiaab+gacaqGNbGaaeyAaiaabshadaqadaqaaiaabcfacaqG YbWaaiWaaeaacaWGzbGaeyypa0ZaaqGaaeaacaaIXaaacaGLiWoaca WHybaacaGL7bGaayzFaaaacaGLOaGaayzkaaGaeyypa0JaaGimaiaa cYcacaaI4aGaamiwamaaBaaaleaacaaI3aaabeaakiabgkHiTiaaic dacaGGSaGaaG4naiaadIfadaWgaaWcbaGaaGioaaqabaGccqGHRaWk caaIWaGaaiilaiaaiAdacaWGybWaaSbaaSqaaiaaikdacaaI2aaabe aakiabgkHiTiaaicdacaGGSaGaaGynaiaadIfadaWgaaWcbaGaaGOm aiaaiIdaaeqaaOGaey4kaSIaaGimaiaacYcacaaI0aGaamiwamaaBa aaleaacaaIZaGaaGOnaaqabaGccaGGUaaaaa@669E@

The specified models include one of the strata identifiers in SLCDC (i.e., X 6 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamiwamaaBaaaleaacaaI2aaabeaaaaa@3DA0@  or X 7 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamiwamaaBaaaleaacaaI3aaabeaaaaa@3DA1@ ) with a nested structure for each modeling context.

The finite population used in the simulation was created as follows. The basic working data of 5,868 respondents was duplicated 10 times proportional to the rounded integer values of SLCDC weights, which results in a pseudo-finite population of size 55,950 with complete information on X 1 ,, X 40 . MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamiwamaaBaaaleaacaaIXaaabeaakiaaiYcacqWIMaYscaaISaGa amiwamaaBaaaleaacaaI0aGaaGimaaqabaGccaGGUaaaaa@4370@  The values of response Y MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Gaamywaaaa@3CB5@  were then generated based on Models 1-4 respectively. We consider the variable selection problem to be the identification of the postulated model that generates the values of Y. MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Gaamywaiaac6caaaa@3D67@

We investigate the performance of proposed procedure under two stratified sampling plans. Specifically, we create 4 strata based on variables X 6 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamiwamaaBaaaleaacaaI2aaabeaaaaa@3DA0@  (age, 55-/55+) and X 7 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamiwamaaBaaaleaacaaI3aaabeaaaaa@3DA1@  (sex, Male/Female), which leads to the group (Female, 55-) of size 7,120, group (Female, 55+) of size 19,199, group (Male, 55-) of size 6,187, and group (Male, 55+) of size 23,458. In the first plan, a simple random sampling without replacement (SRSWOR) with equally allocated sample size is drawn from each stratum. The inference is made based on the 4-four SRSWORs pooled together. In the second plan, we further construct three subgroups within each stratum based on the sum of two binary covariates of the postulated models. In particular, the subgroups are built according to X 18 + X 22 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamiwamaaBaaaleaacaaIXaGaaGioaaqabaGccqGHRaWkcaWGybWa aSbaaSqaaiaaikdacaaIYaaabeaaaaa@41CA@  for data generated by Models 1-2, while the subgroups are similarly construct based on X 8 + X 26 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamiwamaaBaaaleaacaaI4aaabeaakiabgUcaRiaadIfadaWgaaWc baGaaGOmaiaaiAdaaeqaaaaa@4113@  for data from Models 3-4. We then make inference based on SRSWORs drawn from each sub-group of the 4-four strata. The overall sample size is equally allocated at the stratum level with a 2:1:2 proportion for the three subgroups within a same stratum. A simple Monte Carlo computation reveals that the sample correlation between X 18 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamiwamaaBaaaleaacaaIXaGaaGioaaqabaaaaa@3E5D@  and X 22 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamiwamaaBaaaleaacaaIYaGaaGOmaaqabaaaaa@3E58@  (for data from Models 1-2) can be as high as 0.5, whereas their population-based correlation is merely around 0.02. Similar phenomenon is also observed between X 8 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamiwamaaBaaaleaacaaI4aaabeaaaaa@3DA2@  and X 26 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamiwamaaBaaaleaacaaIYaGaaGOnaaqabaaaaa@3E5C@  (for data from Models 3-4). We therefore expect variable selection under the second sampling plan to be more challenging due to this systematic inflation. In the simulations, we set the overall sample size n= MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamOBaiabg2da9aaa@3DD0@  500 for Models 1-2 and n= MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamOBaiabg2da9aaa@3DD0@  1,500 for Models 3-4. A summary of influential variables to the response and the design variables affecting the sampling probabilities can be found in Appendix (Table A.2).

The PPL-BIC selection procedure was carried out on probability samples obtained from the finite population. In particular, we scaled the survey weights as mentioned in (3.2) and chose the SCAD penalty for the penalized pseudo-likelihood function (3.5). The corresponding maximizer of (3.5) was solved by using the thresholding-based iterative algorithm (She 2011). For comparison purpose, the ideas of AIC and GCV are also used as alternatives for the proposed BIC (3.4). Based on the discussion in Section 3, we define the pseudo-likelihood-based AIC and GCV as

AIC n ( s )=2 l n ( β ^ s )+2τ( s ), MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaaeyqaiaabMeacaqGdbWaaSbaaSqaaiaad6gaaeqaaOWaaeWabeaa caWGZbaacaGLOaGaayzkaaGaeyypa0JaeyOeI0IaaGOmaiaadYgada WgaaWcbaGaamOBaaqabaGcdaqadeqaaiqahk7agaqcamaaBaaaleaa caWGZbaabeaaaOGaayjkaiaawMcaaiabgUcaRiaaikdacqaHepaDda qadeqaaiaadohaaiaawIcacaGLPaaacaaISaaaaa@5142@

GCV n ( s )= 1 n l n ( β ^ s ) ( 1τ( s )/n ) 2 , MathType@MTEF@5@5@+= feaagKart1ev2aaaMjlvLHfij5gC1rhimfMBNvxyNvgaCrxz4r3Ehn uF7Thx0vgE0TNv913x75wF9XfBLzgDOaYCGWLCPDgA0LspTWLzYf2y 7ftF75wFCzMCHn2E7ThB991EU1xFCXwzMrhkGS3E7XfAHr3ECjwz0f gi91xFFT3C91hiCjxANHgDP03E7ThxSvMz0HciX0cx0fwDGWfBLzgD OaYCGWLCPDgA0LYlUbcxYL2zOrxk951EY0xF9XcatCvAUfeBSjuyZL 2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerb d9wDYLwzYbItLDharqqtubsr4rNCHbcvPDwzYbGeaGqiVCI8FfYJH8 YrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfeaY=biLkVcLq=J Hqpepeea0=as0Fb9pgeaYRXxe9vr0=vr0=vqpWqaaeaabiGaciaaca qabeaadaqaaqaaaOqaaabaaaaaaaaapeGaae4raiaaboeacaqGwbWd amaaBaaaleaapeGaamOBaaWdaeqaaOWdbmaabmaapaqaa8qacaWGZb aacaGLOaGaayzkaaGaeyypa0JaeyOeI0YaaSaaa8aabaWdbiaaigda a8aabaWdbiaad6gaaaWaaSaaa8aabaWdbiaadYgapaWaaSbaaSqaa8 qacaWGUbaapaqabaGcpeWaaeWaa8aabaWdbiqahk7agaqca8aadaWg aaWcbaWdbiaadohaa8aabeaaaOWdbiaawIcacaGLPaaaa8aabaWdbm aabmaapaqaa8qacaaIXaGaeyOeI0IaeqiXdq3aaeWaa8aabaWdbiaa dohaaiaawIcacaGLPaaacaGGVaGaamOBaaGaayjkaiaawMcaa8aada ahaaWcbeqaa8qacaaIYaaaaaaakiaacYcaaaa@98F6@

which are similarly implemented though the PPL-based procedure. Moreover, for each setup, we repeat the selection procedure with all survey weights ignored (being set as unity). The unweighted selection results are corresponding to pure model-based inferences as discussed in Section 2. In particular, the pseudo-likelihood-based BIC reduces to the classic BIC (3.1) used for non-survey situations.

In Tables 5.1-5.2, we summarize the simulation results based on 1,000 repetitions in terms of the positive selection rate (PSR), false discovery rate (FDR), correct selection rate (CSR), and averaged model size (AMS). Specifically, let s 0 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Gaam4CamaaBaaaleaacaaIWaaabeaaaaa@3DB5@  be the true model that generates the finite population and s j MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Gabm4CayaafaWaaSbaaSqaaiaadQgaaeqaaaaa@3DF6@  be the selected model based on the j th MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamOAamaaCaaaleqabaGaaeiDaiaabIgaaaaaaa@3ED5@  sample, j=1,, MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamOAaiabg2da9iaaigdacaaISaGaeSOjGSKaaGilaaaa@4115@ 1,000. The PSR, FDR, CSR and AMS are estimated as

PSR= j=1 1,000 τ( s 0 s j ) 1,000τ( s 0 ) ,  FDR= j=1 1,000 τ( s j / s 0 ) 1,000τ( s j ) , MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaaeiuaiaabofacaqGsbGaeyypa0ZaaSaaaeaadaaeWbqaaiabes8a 0naabmqabaGaam4CamaaBaaaleaacaaIWaaabeaakiabgMIihlqado hagaqbamaaBaaaleaacaWGQbaabeaaaOGaayjkaiaawMcaaaWcbaGa amOAaiabg2da9iaaigdaaeaacaaIXaGaaiilaiaaicdacaaIWaGaaG imaaqdcqGHris5aaGcbaGaaeymaiaabYcacaqGWaGaaeimaiaabcda cqaHepaDdaqadeqaaiaadohadaWgaaWcbaGaaGimaaqabaaakiaawI cacaGLPaaaaaGaaGilaiaabAeacaqGebGaaeOuaiabg2da9maalaaa baWaaabCaeaacqaHepaDdaqadeqaamaalyaabaGabm4CayaafaWaaS baaSqaaiaadQgaaeqaaaGcbaGaam4CamaaBaaaleaacaaIWaaabeaa aaaakiaawIcacaGLPaaaaSqaaiaadQgacqGH9aqpcaaIXaaabaGaaG ymaiaacYcacaaIWaGaaGimaiaaicdaa0GaeyyeIuoaaOqaaiaabgda caqGSaGaaeimaiaabcdacaqGWaGaeqiXdq3aaeWabeaaceWGZbGbau aadaWgaaWcbaGaamOAaaqabaaakiaawIcacaGLPaaaaaGaaGilaaaa @7797@

CSR= j=1 1,000 I( s j = s 0 ) 1,000 ,  AMS= j=1 1,000 τ( s j ) 1,000 , MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Gaae4qaiaabofacaqGsbGaeyypa0ZaaSaaaeaadaaeWbqaaiaadMea daqadeqaaiqadohagaqbamaaBaaaleaacaWGQbaabeaakiabg2da9i aadohadaWgaaWcbaGaaGimaaqabaaakiaawIcacaGLPaaaaSqaaiaa dQgacqGH9aqpcaaIXaaabaGaaGymaiaacYcacaaIWaGaaGimaiaaic daa0GaeyyeIuoaaOqaaiaabgdacaqGSaGaaeimaiaabcdacaqGWaaa aiaacYcacaqGbbGaaeytaiaabofacqGH9aqpdaWcaaqaamaaqahaba GaeqiXdq3aaeWabeaaceWGZbGbauaadaWgaaWcbaGaamOAaaqabaaa kiaawIcacaGLPaaaaSqaaiaadQgacqGH9aqpcaaIXaaabaGaaGymai aacYcacaaIWaGaaGimaiaaicdaa0GaeyyeIuoaaOqaaiaabgdacaqG SaGaaeimaiaabcdacaqGWaaaaiaaiYcaaaa@694D@

where τ( s ) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaeqiXdq3aaeWabeaacaWGZbaacaGLOaGaayzkaaaaaa@401E@  denotes the size of model s MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Gaam4Caaaa@3CCF@  and I( ) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamysamaabmqabaGaeyyXICnacaGLOaGaayzkaaaaaa@4079@  is the indicator function. In addition, we assess the predictive accuracy of the selected model as follows. For each setup, a test sample of size 200 is generated by SRSWOR from the same finite population as that for the training sample. For Models 1-2, we use the averaged residual sum of squares (RSS) on the test data as a measurement of the predictive ability of the selected model. For Models 3-4, we compute both positive and negative prediction rates. To be specific, let π * MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaeqiWda3aaWbaaSqabeaacaGGQaaaaaaa@3E6F@  be a specified benchmark and π ^ i MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GafqiWdaNbaKaadaWgaaWcbaGaamyAaaqabaaaaa@3EBE@  be the estimated success probability of the i th MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamyAamaaCaaaleqabaGaaeiDaiaabIgaaaaaaa@3ED4@  test sample, i=1,,200. MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamyAaiabg2da9iaaigdacaaISaGaeSOjGSKaaGilaiaaikdacaaI WaGaaGimaiaac6caaaa@43F6@  We then predict the i th MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamyAamaaCaaaleqabaGaaeiDaiaabIgaaaaaaa@3ED4@  response y i MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamyEamaaBaaaleaacaWGPbaabeaaaaa@3DEF@  by y ^ i =1 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GabmyEayaajaWaaSbaaSqaaiaadMgaaeqaaOGaeyypa0JaaGymaaaa @3FCA@  if π ^ i > π * MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GafqiWdaNbaKaadaWgaaWcbaGaamyAaaqabaGccqGH+aGpcqaHapaC daahaaWcbeqaaiaacQcaaaaaaa@4268@  and y ^ i =0 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GabmyEayaajaWaaSbaaSqaaiaadMgaaeqaaOGaeyypa0JaaGimaaaa @3FC9@  otherwise. The correct prediction rates are estimated by

PPR= i{ i: y i =1 } I( y ^ i =1 ) i=1 200 I( y i =1 ) ,  NPR= i{ i: y i =0 } I( y ^ i =0 ) i=1 200 I( y i =0 ) . MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaaeiuaiaabcfacaqGsbGaeyypa0ZaaSaaaeaadaaeqbqaaiaadMea daqadeqaaiqadMhagaqcamaaBaaaleaacaWGPbaabeaakiabg2da9i aaigdaaiaawIcacaGLPaaaaSqaaiaadMgacqGHiiIZdaGadeqaaiaa dMgacaGG6aGaamyEamaaBaaameaacaWGPbaabeaaliabg2da9iaaig daaiaawUhacaGL9baaaeqaniabggHiLdaakeaadaaeWbqaaiaadMea daqadeqaaiaadMhadaWgaaWcbaGaamyAaaqabaGccqGH9aqpcaaIXa aacaGLOaGaayzkaaaaleaacaWGPbGaeyypa0JaaGymaaqaaiaaikda caaIWaGaaGimaaqdcqGHris5aaaakiaacYcacaqGobGaaeiuaiaabk facqGH9aqpdaWcaaqaamaaqafabaGaamysamaabmqabaGabmyEayaa jaWaaSbaaSqaaiaadMgaaeqaaOGaeyypa0JaaGimaaGaayjkaiaawM caaaWcbaGaamyAaiabgIGiopaacmqabaGaamyAaiaacQdacaWG5bWa aSbaaWqaaiaadMgaaeqaaSGaeyypa0JaaGimaaGaay5Eaiaaw2haaa qab0GaeyyeIuoaaOqaamaaqahabaGaamysamaabmqabaGaamyEamaa BaaaleaacaWGPbaabeaakiabg2da9iaaicdaaiaawIcacaGLPaaaaS qaaiaadMgacqGH9aqpcaaIXaaabaGaaGOmaiaaicdacaaIWaaaniab ggHiLdaaaOGaaGOlaaaa@8482@

The final PPR and NPR are averaged based on 1,000 replications. Note that PPR and NPR here are similar to sensitivity and specificity in the clinical studies, which indicate the ability of a 0-1 prediction approach in terms of correct positive and negative predictions. In general, a larger π * MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaeqiWda3aaWbaaSqabeaacaGGQaaaaaaa@3E6F@  leads to high NPR but low PPR. The value of π * MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaeqiWda3aaWbaaSqabeaacaGGQaaaaaaa@3E6F@  should be cautiously specified in applications. In our simulation studies, we fix π * = MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaeqiWda3aaWbaaSqabeaacaGGQaaaaOGaeyypa0daaa@3F7F@  0.5 for simplicity.

The results are encouraging for the proposed BIC method. From Tables 5.1-5.2, we observe that models selected by AIC have both high PSR and FDR, which indicates an excessive inclusion of the irrelevant variables. In comparison, the BIC significantly reduces the FDR of selected models with a slight sacrifice on PSR, and selects the model with sizes closer to the truth. Although the GCV behaves similarly to BIC in the linear model settings, it concurs with AIC for the logistic models where less information is provided from the binary responses.

In the first sampling plan, the inclusion probabilities are related to Y MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Gaamywaaaa@3CB5@  only through a single covariate in the model (i.e., X 6 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamiwamaaBaaaleaacaaI2aaabeaaaaa@3DA0@  or X 7 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamiwamaaBaaaleaacaaI3aaabeaaaaa@3DA1@ ). The sample correlation structure between the response and covariates is largely maintained from the finite population. Consequently, no substantial difference is observed between the weighted and unweighted selection procedures from Table 5.1.

Table 5.1
Selection for the design not generating strong spurious correlations (1st plan). Results are summarized in terms of positive selection rate (PSR), false discovery rate (FDR), correct selection rate (CSR) and averaged model size (AMS); Prediction assessments for Models 1-2 are based on the testing residual sum of squares (RSS), while for Models 3-4 they are based on positive/negative prediction rate (PPR, NPR) with a benchmark 0.5.

Table summary
This table displays selection for the design not generating strong spurious correlations. The information is grouped by Weights (appearing as row headers), criterion, PSR, FDR, AMS, prediction (appearing as column headers).
Weights Criterion PSR FDR CSR AMS Prediction
Model 1
Ignored GCV  0.96  0.19  0.28  4.9  1.04 
AIC  0.99  0.48  0.05  8.7  1.08 
BIC  0.96  0.19  0.28  4.9  1.04 
Included GCV  0.95  0.24  0.19  5.2  1.05 
AIC  0.99  0.61  0.01  11.4  1.11 
BIC  0.95  0.24  0.20  5.3  1.05 
Model 2
Ignored GCV  0.72  0.19  0.02  5.5  1.07 
AIC  0.89  0.44  0.01  10.3  1.09 
BIC  0.73  0.19  0.03  5.6  1.07 
Included GCV  0.74  0.24  0.02  6.1  1.08 
AIC  0.89  0.54  0.01  12.5  1.12 
BIC  0.74  0.24  0.03  6.1  1.08 
Model 3
Ignored GCV  0.99  0.59  0.00  7.8  (0.71, 0.45)
AIC  0.99  0.62  0.00  8.4  (0.69, 0.49)
BIC  0.96  0.43  0.00  5.1  (0.72, 0.44)
Included GCV  0.99  0.67  0.00  9.9  (0.71, 0.47)
AIC  0.99  0.70  0.00  10.7  (0.68, 0.48)
BIC  0.94  0.45  0.00  5.3  (0.71, 0.45)
Model 4
Ignored GCV  0.97  0.44  0.01  9.4  (0.66, 0.55)
AIC  0.98  0.47  0.01  9.8  (0.65, 0.56)
BIC  0.87  0.26  0.07  6.0  (0.69, 0.53)
Included GCV  0.98  0.54  0.01  11.4  (0.66, 0.54)
AIC  0.98  0.56  0.00  11.9  (0.66, 0.55)
BIC  0.86  0.30  0.05  6.2  (0.68, 0.53)

The insights of using sampling weights in variable selection are tentatively revealed from the second sampling plan, where the sample correlation structure is systemically distorted. Clearly, the spurious correlation between covariates in the sampled units deteriorates the efficiency of selection methods. This is reflected from the depressed PSRs and the inflated FDRs from the unweighted procedures. Incorporating sampling weights in the selecting process is helpful to correct the biased result. In particular, noticeable improvements have been observed for the BIC-based selection. In the most impressive case (i.e., Model 3 of Table 5.2), the pseudo-likelihood-based BIC substantially improves the classic BIC by increasing the PSR from 0.65 up to 0.89, while reduces the corresponding FDR from 0.62 down to 0.50. Our observation echoes the rationale of weighting as the removal of bias due to the informative sampling (Section 6.3, Fuller 2009).

Table 5.2
Selection for the design generating strong spurious correlations (2nd plan). Results are summarized in terms of positive selection rate (PSR), false discovery rate (FDR), correct selection rate (CSR) and averaged model size (AMS); Prediction assessments for Models 1-2 are based on the testing residual sum of squares (RSS), while for Models 3-4 they are based on positive/negative prediction rate (PPR, NPR) with a benchmark 0.5.

Table summary
This table displays selection for the design generating strong spurious correlations. The information is grouped by Weights (appearing as row headers), criterion, PSR, FDR, AMS, prediction (appearing as column headers).
Weights Criterion PSR FDR CSR AMS Prediction
Model 1
Ignored GCV  0.83  0.23  0.17  4.6  1.09 
AIC  0.97  0.49  0.04  8.6  1.10 
BIC  0.83  0.23  0.17  4.6  1.09 
Included GCV  0.95  0.31  0.13  5.9  1.07 
AIC  0.99  0.65  0.00  12.5  1.12 
BIC  0.95  0.30  0.14  5.9  1.07 
Model 2
Ignored GCV  0.62  0.22  0.02  5.0  1.13 
AIC  0.88  0.45  0.01  10.3  1.14 
BIC  0.62  0.22  0.02  5.1  1.12 
Included GCV  0.72  0.28  0.01  6.5  1.10 
AIC  0.89  0.59  0.00  13.7  1.12 
BIC  0.72  0.27  0.01  6.5  1.10 
Model 3
Ignored GCV  0.87  0.62  0.00  7.3  (0.66, 0.44) 
AIC  0.88  0.63  0.00  7.6  (0.65, 0.45) 
BIC  0.65  0.62  0.00  4.5  (0.68, 0.42) 
Included GCV  0.97  0.74  0.00  11.9  (0.70, 0.46) 
AIC  0.97  0.75  0.00  12.4  (0.68, 0.46) 
BIC  0.89  0.50  0.00  5.6  (0.70, 0.44) 
Model 4
Ignored GCV  0.94  0.48  0.00  9.5  (0.62, 0.51) 
AIC  0.95  0.50  0.00  10.0  (0.62, 0.52) 
BIC  0.72  0.41  0.00  6.1  (0.64, 0.49) 
Included GCV  0.93  0.61  0.00  12.5  (0.64, 0.53) 
AIC  0.94  0.62  0.00  12.9  (0.64, 0.53) 
BIC  0.82  0.34  0.01  6.4  (0.67, 0 ,54) 

5.3 Scenario 2: Model mis-specification

A well-known rationale for using sampling weights is that it provides protection against model mis-specification (Pfeffermann and Holmes 1985; Kott 1991): the inferences based on weighted estimates may remain valid for the surveyed population, even when the model fails. To gain further insights of weighting in variable selection, we further compare the proposed BIC with the classic unweighted methods in the simulation where the presumed model is misspecified from the model that generates the data. In such situations, a postulated "true� model does not exist, and the goal of variable selection is to find an optimal model that well describes the finite population. We still make use of the stratified pseudo-finite population in Section 5.2, but generate the response variable Y MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Gaamywaaaa@3CB5@  according to the strata. Specifically, the values of Y MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Gaamywaaaa@3CB5@  for units in strata (Male, 55+) and (Female, 55+) were generated by

Y=0.6 X 6 +0.4 X 18 +0.4 X 20 +0.6 X 38 +ε, MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Gaamywaiabg2da9iaaicdacaaIUaGaaGOnaiaadIfadaWgaaWcbaGa aGOnaaqabaGccqGHRaWkcaaIWaGaaGOlaiaaisdacaWGybWaaSbaaS qaaiaaigdacaaI4aaabeaakiabgUcaRiaaicdacaaIUaGaaGinaiaa dIfadaWgaaWcbaGaaGOmaiaaicdaaeqaaOGaey4kaSIaaGimaiaai6 cacaaI2aGaamiwamaaBaaaleaacaaIZaGaaGioaaqabaGccqGHRaWk cqaH1oqzcaaISaaaaa@55E2@

while the values Y MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Gaamywaaaa@3CB5@  for units in the strata (Male, 55-) and (Female, 55-) were generated by

Y=0.6 X 6 +0.4 X 18 +0.4 X 20 +ε MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Gaamywaiabg2da9iaaicdacaaIUaGaaGOnaiaadIfadaWgaaWcbaGa aGOnaaqabaGccqGHRaWkcaaIWaGaaGOlaiaaisdacaWGybWaaSbaaS qaaiaaigdacaaI4aaabeaakiabgUcaRiaaicdacaaIUaGaaGinaiaa dIfadaWgaaWcbaGaaGOmaiaaicdaaeqaaOGaey4kaSIaeqyTdugaaa@4F86@

with εN( 0,1 ) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaeqyTduMaeSipIOJaamOtamaabmqabaGaaGimaiaaiYcacaaIXaaa caGLOaGaayzkaaaaaa@432F@  denoting a random error. In other words, we assume that variable X 38 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamiwamaaBaaaleaacaaIZaGaaGioaaqabaaaaa@3E5F@  is influential only for people aged 55 and older, but not for people younger than 55. In addition, we further violate the presumed model 1 by excluding X 6 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamiwamaaBaaaleaacaaI2aaabeaaaaa@3DA0@  from the set of candidate covariates, which mimics the situation where one important design feature is omitted in the modeling. A stratified SRSWOR of size 500 or 1,000 is drawn using the first sampling plan in Section 5.2. The weighted and unweighted procedures are then tested for the variable selection based on the sampled units.

We summarize the simulation results in Table 5.3 by estimating the selection rates of X 18 , MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamiwamaaBaaaleaacaaIXaGaaGioaaqabaGccaGGSaaaaa@3F17@   X 20 , MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamiwamaaBaaaleaacaaIYaGaaGimaaqabaGccaGGSaaaaa@3F10@  and X 38 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamiwamaaBaaaleaacaaIZaGaaGioaaqabaaaaa@3E5F@  based on 1,000 replications. Similar to the previous simulations, the averaged model size (AMS) and the testing RSS of selected models (i.e., the averaged RSS based on testing data of size 200) are also included in the summary. From Table 5.3, we see that when the model assumption is violated, the pseudo-likelihood-based BIC still achieves relatively high prediction accuracy by suggesting relevant variables with high probability. In contrast, ignoring the survey weights leads to nearly 9% relative loss on the testing RSS because of the exclusion of X 38 . MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamiwamaaBaaaleaacaaIZaGaaGioaaqabaGccaGGUaaaaa@3F1B@  Apparently, increasing the sample size helps to improve the goodness of fit for the misspecified models, yet the improvement is at a cost by including more variables.

Table 5.3
Selection frequency of influential variables in model mis-specified case; The averaged model size (AMS) and the testing residual sum of squares (RSS) are also reported.

Table summary
This table displays the selection frequency of influential variables in model mis-specified case. The information is grouped by Weights (appearing as row headers), Criterion, X18, X20, X38, AMS, Testing RSS(appearing as column headers).
Weights Criterion X18 X20 X38 AMS Testing RSS
n = 500
Ignored GCV  0.78  0.95  0.56  5.9  1.93 
AIC  0.95  0.99  0.73  12.5  1.95 
BIC  0.83  0.97  0.60  6.6  1.93 
Included GCV  0.73  0.92  0.84  6.3  1.77 
AIC  0.91  0.99  0.85  12.5  1.79 
BIC  0.78  0.94  0.83  6.9  1.77 
n = 1 000
Ignored GCV  0.96  1.00  0.79  7.6  1.87 
AIC  0.99  1.00  0.87  13.1  1.88 
BIC  0.97  1.00  0.80  7.9  1.87 
Included GCV  0.93  1.00  0.94  7.6  1.71 
AIC  0.98  1.00  0.96  13.0  1.72 
BIC  0.94  1.00  0.94  7.7  1.71 

5.4 Analysis of SLCDC data

To illustrate the application of proposed BIC, we use it to identify health behaviors that affect the control of blood pressure using SLCDC 2009. The response variable is BMHX_02 from the working data obtained from SLCDC, which has 2 levels indicating whether or not the blood pressure of the respondent is under control, based on the latest measurement by a health professional. We treat the remaining 39 variables in the working data as candidate covariates, and our goal is to identify the influential covariates that are associated with blood-pressure control. We build a logistic regression of BMHX_02 on the candidate covariates and use PPL-BIC with SCAD penalty to select the influential ones (weights are scaled by k= 10 3 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Gaam4Aaiabg2da9iaaigdacaaIWaWaaWbaaSqabeaacqGHsislcaaI Zaaaaaaa@4119@ ). As a preliminary step, each covariate is standardized such that the corresponding first and second weighted sample moments are zero and unity respectively. For comparison, the AIC and GCV are also used in the analysis.

In Figure 5.1, we plot the scores of criterion with respect to the degree of model sparsity. We see that the BIC selects a model with 11 covariates, while the GCV and AIC pick the same model with 24 covariates. When survey weights are ignored in the selection procedure, models with 7 or 21 covariates are suggested based on the standard BIC or GCV/AIC. The distinction between the weighted and unweighted selection results reflects the potential distortion in the correlation structure of the sampled units. Such a distinction may also be explained by model mis-specification for part of the SLCDC population (Lohr and Liu 1994). Given the potential bias for unweighted methods, the weighted selection results are more plausible in the analysis.

Figure 5.1 Selection criteria values based on candidate models

Description for figure 5.1

Figure 5.1 Selection criteria values based on candidate models

We further assess the selected models in terms of predictive accuracy as follows. First, we draw 500 independent sets of 5,868 bootstrap samples (with replacement) from the working data of SLCDC. For the t th MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamiDamaaCaaaleqabaGaaeiDaiaabIgaaaaaaa@3EDF@  bootstrap sample d t , MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamizamaaBaaaleaacaWG0baabeaakiaacYcaaaa@3E9F@   t=1,,500, MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamiDaiabg2da9iaaigdacaaISaGaeSOjGSKaaGilaiaaiwdacaaI WaGaaGimaiaacYcaaaa@4402@  the survey weight w i MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Gaam4DamaaBaaaleaacaWGPbaabeaaaaa@3DED@  for the i th MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamyAamaaCaaaleqabaGaaeiDaiaabIgaaaaaaa@3ED4@  unit is adjusted by w ˜ ti = v ti w i MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Gabm4DayaaiaWaaSbaaSqaaiaadshacaWGPbaabeaakiabg2da9iaa dAhadaWgaaWcbaGaamiDaiaadMgaaeqaaOGaam4DamaaBaaaleaaca WGPbaabeaaaaa@4533@  with v ti MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamODamaaBaaaleaacaWG0bGaamyAaaqabaaaaa@3EE5@  denoting the number of times that the i th MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamyAamaaCaaaleqabaGaaeiDaiaabIgaaaaaaa@3ED4@  unit is selected in d t . MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamizamaaBaaaleaacaWG0baabeaakiaac6caaaa@3EA1@  We then fit the selected models to each bootstrap sample (with weights accounted accordingly), and evaluate their weighted positive and negative prediction rates (WPPR, WNPR) by

WPPR= i d t w i I( y ^ i =1, y i =1 ) i d t w i I( y i =1 ) ,  WNPR= i d t w i I( y ^ i =0, y i =0 ) i d t w i I( y i =0 ) , MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Gaae4vaiaabcfacaqGqbGaaeOuaiabg2da9maalaaabaWaaabuaeaa caWG3bWaaSbaaSqaaiaadMgaaeqaaOGaamysamaabmqabaGabmyEay aajaWaaSbaaSqaaiaadMgaaeqaaOGaeyypa0JaaGymaiaaiYcacaWG 5bWaaSbaaSqaaiaadMgaaeqaaOGaeyypa0JaaGymaaGaayjkaiaawM caaaWcbaGaamyAaiqbgIGioBaawaGaamizamaaBaaameaacaWG0baa beaaaSqab0GaeyyeIuoaaOqaamaaqafabaGaam4DamaaBaaaleaaca WGPbaabeaakiaadMeadaqadeqaaiaadMhadaWgaaWcbaGaamyAaaqa baGccqGH9aqpcaaIXaaacaGLOaGaayzkaaaaleaacaWGPbGafyicI4 SbaybacaWGKbWaaSbaaWqaaiaadshaaeqaaaWcbeqdcqGHris5aaaa kiaacYcacaqGxbGaaeOtaiaabcfacaqGsbGaeyypa0ZaaSaaaeaada aeqbqaaiaadEhadaWgaaWcbaGaamyAaaqabaGccaWGjbWaaeWabeaa ceWG5bGbaKaadaWgaaWcbaGaamyAaaqabaGccqGH9aqpcaaIWaGaaG ilaiaadMhadaWgaaWcbaGaamyAaaqabaGccqGH9aqpcaaIWaaacaGL OaGaayzkaaaaleaacaWGPbGafyicI4SbaybacaWGKbWaaSbaaWqaai aadshaaeqaaaWcbeqdcqGHris5aaGcbaWaaabuaeaacaWG3bWaaSba aSqaaiaadMgaaeqaaOGaamysamaabmqabaGaamyEamaaBaaaleaaca WGPbaabeaakiabg2da9iaaicdaaiaawIcacaGLPaaaaSqaaiaadMga cuGHiiIZgaGfaiaadsgadaWgaaadbaGaamiDaaqabaaaleqaniabgg HiLdaaaOGaaiilaaaa@8C1E@

where y i MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamyEamaaBaaaleaacaWGPbaabeaaaaa@3DEF@  and y ^ i MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GabmyEayaajaWaaSbaaSqaaiaadMgaaeqaaaaa@3DFF@  denote the i th MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamyAamaaCaaaleqabaGaaeiDaiaabIgaaaaaaa@3ED4@  response in BMHX_02 and its predicted value. We summarize the averaged WPPR and WNPR based on 500 bootstrap samples in Table 5.4 according to three different benchmark values (i.e., 0.25, 0.35, 0.45).

From Table 5.4, we observe that the models selected from unweighted analysis have lower WPPR in general, which provides additional support for using survey weights in the selection procedure. Compared with GCV/AIC, the BIC selects the model with a slightly conservative WPPR but a higher WNPR. Nevertheless, the difference is not significant. Noticeably, the size of BIC-selected model is much less than the GCV/AIC selected one, which provides an easier interpretation between the response BMHX_02 and the covariates.

Table 5.4
Prediction accuracy for selected models: (WPPR, WNPR) based on different benchmarks.

Table summary
This table displays prediction accuracy for selected models. The information is grouped by weights (appearing as row headers), criterion, ≥0.25, ≥0.35, ≥0.45 (appearing as column headers).
Weights Criterion ≥0.25 ≥0.35 ≥0.45
Ignored AIC/GCV  (0.646, 0.525)  (0.460, 0.688)  (0.299, 0.811) 
BIC  (0.649, 0.513)  (0.445, 0.705)  (0.265, 0.818) 
Included AIC/GCV  (0.645, 0.523)  (0.488, 0.682)  (0.338, 0.790) 
BIC  (0.654, 0.532)  (0.485, 0.706)  (0.322, 0.830) 

To assess the stability of selection, we repeat the weighted selection procedure based on the 500 bootstrap samples. In Table 5.5, we list the bootstrap selection rate for the seven most significant covariates according to their MLE in the original SLCDC working data. The corresponding coefficient estimates and standard errors are also included based on the bootstrap samples. From Table 5.5, we find that only four significant covariates (i.e., DHHX_AGE, GENXDHMH, INHX_06, HWTDBMI) are consistently selected by BIC, while the GCV/AIC tends to pick more unreliable ones in the model. The BIC-based selection result suggests that the control of blood pressure is strongly associated with age, body weights, mental health and the medication information. Our observation echoes many hypertension studies in the literature (see, e.g. Gelber, Gaziano, Manson, Buring and Sesso 2007; Yan, Liu, Matthews, Daviglus, Ferguson and Kiefe 2003).

Table 5.5
Bootstrap selection results for significant variables: (Estimated coefficient, Standard error, Selection rate).

Table summary
This table displays bootstrap selection results for significant variables. The information is grouped by variable (appearing as row headers), GCV, AIC, BIC (appearing as column headers).
Variable GCV AIC BIC
GEO_ON  (0.14, 0.09, 0.86)  (0.16, 0.09, 0.92)  (0.09, 0.09, 0.58) 
DHHX_AGE  (-0.29, 0.09, 1.0)  (-0.32, 0.09, 1.0)  (-0.27, 0.08, 1.0) 
GENXDHMH  (-0.15, 0.05, 0.99)  (-0.15, 0.05, 0.99)  (-0.14, 0.06, 0.92) 
SMHXDSLT  (0.11, 0.07, 0.76)  (0.12, 0.07, 0.84)  (0.08, 0.09, 0.47) 
MOHXDBPM  (-0.08, 0.07, 0.67)  (-0.09, 0.06, 0.81)  (-0.05, 0.07, 0.35) 
INHX_06  (0.18, 0.06, 0.97)  (0.18, 0.06, 0.99)  (0.18, 0.07, 0.91) 
HWTDBMI  (0.14, 0.06, 0 ,95)  (0.14, 0.06, 0 ,97)  (0.13, 0.06, 0 ,91) 
Ave. Model Size 23.1 27.8 10.3

Previous | Next

Date modified: