Model-assisted calibration of non-probability sample survey data using adaptive LASSO
Section 5. Application to National Health Interview Survey (NHIS)

5.1  NHIS and ACS data

We next apply LASSO calibration to National Health Interview Survey (NHIS) 2013 to estimate the total number of adults (age 18 or older) diagnosed with cancer in the population. The National Health Interview Survey is a nationally representative sample of non-institutionalized civilian households collected by a multi-stage area-probability sampling (Centers for Disease Control and Prevention, 2005). Each month, health-related data on a cross-sectional sample of people in selected households are obtained by face-to-face interviews. The data provides pseudo-primary-sampling-unit (PSU), pseudo-strata, and sampling weights to allow for weighted estimates with complex survey design. In addition to health-related measures, NHIS also collects demographic data. Our goal is to assess our LASSO estimator by treating the unweighted NHIS sample as reflective of a non-probability sample, and explore how GREG and LASSO calibration compare with the design-weighted estimator.

To calibrate NHIS on a set of demographic and income-related variables, we use the American Community Survey (ACS) 2013 micro-data as the benchmark data. ACS samples are households selected through multi-stage area-probability sampling from 3,143 counties of the U.S. The design of ACS is to improve estimates of small areas between the decennial census long-form samples. Around three million households are selected each year, with measures collected on household types and individual demographics within the households. ACS also collects data from group-quarters, which are excluded from this analysis. For ACS 2013, the sample size for adults is 2,317,301. The NHIS 2013 sample size is 34,201 after removing 242 cases with missing values on demographic variables. For the purposes of this analysis, we treat the weighted estimates from the ACS as known population totals, a reasonable assumption given the differences in sample size.

5.2  Estimators

The outcome variable of interest is whether a person has been diagnosed with cancer. Define the binary indicator for the outcome variable:

y i = { 1: if person i has been diagnosed with cancer 0: otherwise . MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8WrFr0xc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peuj0lXxdrpe0db9Wqpepic9qr=xfr=xfr=tmeaabaqaciGa caGaaeqabaqaaeaadaaakeaacaWG5bWaaSbaaSqaaiaadMgaaeqaaO GaaGypamaabeaabaqbaeaabiGaaaqaaiaaigdacaaMi8UaaGjcVlaa cQdaaeaacaqGPbGaaeOzaiaaysW7caqGWbGaaeyzaiaabkhacaqGZb Gaae4Baiaab6gacaaMe8UaaGPaVlaadMgacaaMc8UaaGjbVlaabIga caqGHbGaae4CaiaaysW7caqGIbGaaeyzaiaabwgacaqGUbGaaGjbVl aabsgacaqGPbGaaeyyaiaabEgacaqGUbGaae4BaiaabohacaqGLbGa aeizaiaaysW7caqG3bGaaeyAaiaabshacaqGObGaaGjbVlaabogaca qGHbGaaeOBaiaabogacaqGLbGaaeOCaaqaaiaaicdacaaMi8UaaGjc VlaacQdaaeaacaqGVbGaaeiDaiaabIgacaqGLbGaaeOCaiaabEhaca qGPbGaae4CaiaabwgacaqGUaaaaaGaay5Eaaaaaa@76B3@

We first use the NHIS 2013 sampling weights, w NHIS , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8WrFr0xc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peuj0lXxdrpe0db9Wqpepic9qr=xfr=xfr=tmeaabaqaciGa caGaaeqabaqaaeaadaaakeaacaWH3bWaaWbaaSqabeaacaqGobGaae isaiaabMeacaqGtbaaaOGaaiilaaaa@3716@ and design variables to obtain an unbiased estimate of the population total, T y = i = 1 N y i . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8WrFr0xc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peuj0lXxdrpe0db9Wqpepic9qr=xfr=xfr=tmeaabaqaciGa caGaaeqabaqaaeaadaaakeaacaWGubWaaSbaaSqaaiaadMhaaeqaaO GaaGypamaaqadabeWcbaGaamyAaiaai2dacaaIXaaabaGaamOtaaqd cqGHris5aOGaaGPaVlaadMhadaWgaaWcbaGaamyAaaqabaGccaGGUa aaaa@3E74@ Then we assume that the NHIS 2013 sample is collected from a simple-random-sampling, with initial design weights d A = N / n , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8WrFr0xc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peuj0lXxdrpe0db9Wqpepic9qr=xfr=xfr=tmeaabaqaciGa caGaaeqabaqaaeaadaaakeaacaWHKbWaaWbaaSqabeaacaWGbbaaaO GaaGypamaalyaabaGaamOtaaqaaiaad6gaaaGaaiilaaaa@372E@ where N MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8WrFr0xc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peuj0lXxdrpe0db9Wqpepic9qr=xfr=xfr=tmeaabaqaciGa caGaaeqabaqaaeaadaaakeaacaWGobaaaa@32C4@ is the population total obtained from ACS, and n MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8WrFr0xc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peuj0lXxdrpe0db9Wqpepic9qr=xfr=xfr=tmeaabaqaciGa caGaaeqabaqaaeaadaaakeaacaWGUbaaaa@32E4@ is the sample size of NHIS. We calibrate d A MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8WrFr0xc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peuj0lXxdrpe0db9Wqpepic9qr=xfr=xfr=tmeaabaqaciGa caGaaeqabaqaaeaadaaakeaacaWHKbWaaWbaaSqabeaacaWGbbaaaa aa@33D1@ by a set of demographic and income variables with traditional GREG calibration and LASSO calibration. Finally, as a compromise between GREG and LASSO, we consider model-assisted calibration to a linear model for y i MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8WrFr0xc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peuj0lXxdrpe0db9Wqpepic9qr=xfr=xfr=tmeaabaqaciGa caGaaeqabaqaaeaadaaakeaacaWG5bWaaSbaaSqaaiaadMgaaeqaaa aa@3409@ instead of the LASSO using (2.7); note that, when μ ^ i MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8WrFr0xc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peuj0lXxdrpe0db9Wqpepic9qr=xfr=xfr=tmeaabaqaciGa caGaaeqabaqaaeaadaaakeaacuaH8oqBgaqcamaaBaaaleaacaWGPb aabeaaaaa@34D1@ is computed using the same linear model as in GREG, the point estimates of the total will correspond, even though the calibration weights will differ. Thus, we generate seven estimates:

  1.   T ^ y NHIS = i s A w i NHIS y i MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8WrFr0xc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peuj0lXxdrpe0db9Wqpepic9qr=xfr=xfr=tmeaabaqaciGa caGaaeqabaqaaeaadaaakeaaceWGubGbaKaadaqhaaWcbaGaamyEaa qaaiaab6eacaqGibGaaeysaiaabofaaaGccaaI9aWaaabeaeqaleaa caWGPbGaeyicI4Saam4CamaaBaaameaacaWGbbaabeaaaSqab0Gaey yeIuoakiaaykW7caWG3bWaa0baaSqaaiaadMgaaeaacaqGobGaaeis aiaabMeacaqGtbaaaOGaamyEamaaBaaaleaacaWGPbaabeaakiaayI W7caGG6aaaaa@49C5@ : Estimate obtained with NHIS weights.
  2.   T ^ y HTSRS = i s A ( N / n ) y i MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8WrFr0xc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peuj0lXxdrpe0db9Wqpepic9qr=xfr=xfr=tmeaabaqaciGa caGaaeqabaqaaeaadaaakeaaceWGubGbaKaadaqhaaWcbaGaamyEaa qaaiaabIeacaqGubGaae4uaiaabkfacaqGtbaaaOGaaGypamaaqaba beWcbaGaamyAaiabgIGiolaadohadaWgaaadbaGaamyqaaqabaaale qaniabggHiLdGcdaqadaqaamaalyaabaGaamOtaaqaaiaad6gaaaaa caGLOaGaayzkaaGaamyEamaaBaaaleaacaWGPbaabeaakiaayIW7ca GG6aaaaa@4725@ : Estimate obtained with weights d A = N / n . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8WrFr0xc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peuj0lXxdrpe0db9Wqpepic9qr=xfr=xfr=tmeaabaqaciGa caGaaeqabaqaaeaadaaakeaacaWHKbWaaWbaaSqabeaacaWGbbaaaO GaaGypamaalyaabaGaamOtaaqaaiaad6gaaaGaaiOlaaaa@3730@
  3.   T ^ y GREG1 = i s A w i GREG1 y i MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8WrFr0xc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peuj0lXxdrpe0db9Wqpepic9qr=xfr=xfr=tmeaabaqaciGa caGaaeqabaqaaeaadaaakeaaceWGubGbaKaadaqhaaWcbaGaamyEaa qaaiaabEeacaqGsbGaaeyraiaabEeacaqGXaaaaOGaaGypamaaqaba beWcbaGaamyAaiabgIGiolaadohadaWgaaadbaGaamyqaaqabaaale qaniabggHiLdGccaaMc8Uaam4DamaaDaaaleaacaWGPbaabaGaae4r aiaabkfacaqGfbGaae4raiaabgdaaaGccaWG5bWaaSbaaSqaaiaadM gaaeqaaOGaaGjcVlaacQdaaaa@4B13@ : Estimate obtained by calibrating d A MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8WrFr0xc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peuj0lXxdrpe0db9Wqpepic9qr=xfr=xfr=tmeaabaqaciGa caGaaeqabaqaaeaadaaakeaacaWHKbWaaWbaaSqabeaacaWGbbaaaa aa@33D1@ with GREG using all calibration variables.
  4.   t ^ y GREG1MC = i s A w i GREG1MC y i MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8WrFr0xc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peuj0lXxdrpe0db9Wqpepic9qr=xfr=xfr=tmeaabaqaciGa caGaaeqabaqaaeaadaaakeaaceWG0bGbaKaadaqhaaWcbaGaamyEaa qaaiaabEeacaqGsbGaaeyraiaabEeacaqGXaGaaeytaiaaboeaaaGc caaI9aWaaabeaeqaleaacaWGPbGaeyicI4Saam4CamaaBaaameaaca WGbbaabeaaaSqab0GaeyyeIuoakiaaykW7caWG3bWaa0baaSqaaiaa dMgaaeaacaqGhbGaaeOuaiaabweacaqGhbGaaeymaiaab2eacaqGdb aaaOGaamyEamaaBaaaleaacaWGPbaabeaakiaayIW7caGG6aaaaa@4E5F@ : Estimate obtained by model-assisted calibration to linear model using predictors in GREG1.
  5.   T ^ y GREG2 = i s A w i GREG2 y i MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8WrFr0xc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peuj0lXxdrpe0db9Wqpepic9qr=xfr=xfr=tmeaabaqaciGa caGaaeqabaqaaeaadaaakeaaceWGubGbaKaadaqhaaWcbaGaamyEaa qaaiaabEeacaqGsbGaaeyraiaabEeacaqGYaaaaOGaaGypamaaqaba beWcbaGaamyAaiabgIGiolaadohadaWgaaadbaGaamyqaaqabaaale qaniabggHiLdGccaaMe8Uaam4DamaaDaaaleaacaWGPbaabaGaae4r aiaabkfacaqGfbGaae4raiaabkdaaaGccaWG5bWaaSbaaSqaaiaadM gaaeqaaOGaaGjcVlaacQdaaaa@4B17@ : Estimate obtained by calibrating d A MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8WrFr0xc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peuj0lXxdrpe0db9Wqpepic9qr=xfr=xfr=tmeaabaqaciGa caGaaeqabaqaaeaadaaakeaacaWHKbWaaWbaaSqabeaacaWGbbaaaa aa@33D1@ with GREG using only calibration variables chosen using backward stepwise variable selection.
  6.   t ^ y GREG2MC = i s A w i GREG2MC y i MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8WrFr0xc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peuj0lXxdrpe0db9Wqpepic9qr=xfr=xfr=tmeaabaqaciGa caGaaeqabaqaaeaadaaakeaaceWG0bGbaKaadaqhaaWcbaGaamyEaa qaaiaabEeacaqGsbGaaeyraiaabEeacaqGYaGaaeytaiaaboeaaaGc caaI9aWaaabeaeqaleaacaWGPbGaeyicI4Saam4CamaaBaaameaaca WGbbaabeaaaSqab0GaeyyeIuoakiaaysW7caWG3bWaa0baaSqaaiaa dMgaaeaacaqGhbGaaeOuaiaabweacaqGhbGaaeOmaiaab2eacaqGdb aaaOGaamyEamaaBaaaleaacaWGPbaabeaakiaayIW7caGG6aaaaa@4E63@ : Estimate obtained by model-assisted calibration to linear model using predictors in GREG2.
  7.   T ^ y LASSO = i s A w i LASSO y i MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8WrFr0xc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peuj0lXxdrpe0db9Wqpepic9qr=xfr=xfr=tmeaabaqaciGa caGaaeqabaqaaeaadaaakeaaceWGubGbaKaadaqhaaWcbaGaamyEaa qaaiaabYeacaqGbbGaae4uaiaabofacaqGpbaaaOGaaGypamaaqaba beWcbaGaamyAaiabgIGiolaadohadaWgaaadbaGaamyqaaqabaaale qaniabggHiLdGccaaMe8Uaam4DamaaDaaaleaacaWGPbaabaGaaeit aiaabgeacaqGtbGaae4uaiaab+eaaaGccaWG5bWaaSbaaSqaaiaadM gaaeqaaOGaaGjcVlaacQdaaaa@4B6D@ : Estimate obtained by model-assisted calibration with LASSO.

The variance of T ^ y NHIS MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8WrFr0xc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peuj0lXxdrpe0db9Wqpepic9qr=xfr=xfr=tmeaabaqaciGa caGaaeqabaqaaeaadaaakeaaceWGubGbaKaadaqhaaWcbaGaamyEaa qaaiaab6eacaqGibGaaeysaiaabofaaaaaaa@3743@ is the linearization variance estimate of total, accounting for sampling-stratum, primary-sampling-units, and survey weights in the NHIS 2013 sample. Variances of HTSRS, GREG1, and GREG2 are linearization variance estimates with weights d A , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8WrFr0xc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peuj0lXxdrpe0db9Wqpepic9qr=xfr=xfr=tmeaabaqaciGa caGaaeqabaqaaeaadaaakeaacaWHKbWaaWbaaSqabeaacaWGbbaaaO GaaGzaVlaacYcaaaa@3615@ w GREG1 , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8WrFr0xc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peuj0lXxdrpe0db9Wqpepic9qr=xfr=xfr=tmeaabaqaciGa caGaaeqabaqaaeaadaaakeaacaWH3bWaaWbaaSqabeaacaqGhbGaae OuaiaabweacaqGhbGaaeymaaaakiaaygW7caGGSaaaaa@3947@ and w GREG2 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8WrFr0xc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peuj0lXxdrpe0db9Wqpepic9qr=xfr=xfr=tmeaabaqaciGa caGaaeqabaqaaeaadaaakeaacaWH3bWaaWbaaSqabeaacaqGhbGaae OuaiaabweacaqGhbGaaeOmaaaaaaa@3704@ respectively. We obtain the variance of LASSO estimator through naive bootstrap.

5.3  Working models

Table 5.1 lists calibration variable names, labels, values, and distributions in this analysis. The first column is the unweighted distribution of variables in the NHIS sample. The second column contains variable distributions in the NHIS sample, weighted by w NHIS MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8WrFr0xc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peuj0lXxdrpe0db9Wqpepic9qr=xfr=xfr=tmeaabaqaciGa caGaaeqabaqaaeaadaaakeaacaWH3bWaaWbaaSqabeaacaqGobGaae isaiaabMeacaqGtbaaaaaa@365C@ person-level weights. The third column is the distribution of variables in the population obtained from the ACS benchmark data. Missing income category is included as a separate category to capture the difference in missing patterns between NHIS and ACS. Including a missing category also allows us to maintain the analytic sample size. Relative to ACS, the unweighted NHIS sample has higher proportions of females, widowed/divorced/separated, and fewer proportion of non-Hispanic whites. After weighting, the NHIS distributions of gender and race are close to the benchmark’s, and only marital status categories show some differences.

We use an unweighted linear model with backward-stepwise variable selection to determine the working model for GREG2. The final variables included in the model for GREG2 are age, education, race, employment status (yes/no), and family income. For standard GREG and LASSO calibration, we use all available variables.

Table 5.1
Calibration variables
Table summary
This table displays the results of Calibration variables No weights, NHIS and ACS (appearing as column headers).
No weights NHIS ACS
Person-level weights Person-level weights
Region Northeast 16% 18% 18%
Midwest 20% 23% 21%
South 37% 37% 37%
West 26% 23% 23%
Age 18-29 19% 21% 21%
30-39 17% 17% 17%
40-49 16% 18% 18%
50-59 17% 18% 18%
60-69 15% 14% 14%
70-79 9% 8% 8%
80+ 6% 4% 5%
Gender Male 45% 48% 48%
Female 55% 52% 52%
Education Less than high school 16% 14% 13%
High school or less 26% 26% 28%
Some college 20% 20% 23%
College graduate 29% 30% 25%
Post-graduate 10% 10% 10%
Race/Ethnicity Non-Hispanic white 60% 66% 66%
Non-Hispanic black 15% 12% 12%
Hispanic 17% 15% 15%
Other 8% 7% 7%
Marital Status Married/partnered 49% 60% 52%
Widowed/divorced/separated 27% 18% 20%
Never married 24% 22% 28%
Employed Yes 35% 33% 39%
No 65% 67% 61%
Income 1st quartile 22% 15% 19%
2nd quartile 20% 17% 20%
3rd quartile 21% 22% 20%
4th quartile 21% 28% 19%
missing 17% 19% 22%

5.4  Results

Table 5.2 lists the estimates, standard errors (SE), root mean square error treating the correctly weighted NHIS as the true value (RMSE), percent-deviate from the NHIS estimate: % deviate = 100 ( T ^ T ^ y NHIS ) / T ^ y NHIS , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8WrFr0xc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peuj0lXxdrpe0db9Wqpepic9qr=xfr=xfr=tmeaabaqaciGa caGaaeqabaqaaeaadaaakeaacaaILaGaaeizaiaabwgacaqG2bGaae yAaiaabggacaqG0bGaaeyzaiaai2dadaWcgaqaaiaaigdacaaIWaGa aGimamaabmaabaGabmivayaajaGaeyOeI0IabmivayaajaWaa0baaS qaaiaadMhaaeaacaqGobGaaeisaiaabMeacaqGtbaaaaGccaGLOaGa ayzkaaaabaGabmivayaajaWaa0baaSqaaiaadMhaaeaacaqGobGaae isaiaabMeacaqGtbaaaaaakiaaygW7caGGSaaaaa@4C74@ and the standard deviation and minimum and maximum of the weights associated with a given estimator. We treat NHIS estimate as the unbiased estimate because it is calculated with probability-based sampling weights provided by NHIS. Without any weighting adjustment, HTSRS shows a positive bias of 5.9%. The GREG2 estimator reduces this bias from 5.9% to 2.0%, the GREG1 estimator reduces bias to 1.8%, while LASSO estimator reduces the bias to 0.9%. By definition, use of the model-assisted estimator using linear predictors will yield the same estimator as the GREG model; however the variability is substantially reduced. In this analysis, if NHIS were a non-probability sample, without weighting adjustment, we would have over-counted the number of adults with cancer by 1.18 million. With traditional calibration, the error is reduced to an over-count of 365 thousand (without variable selection) or 392 thousand (with variable selection). LASSO calibration further reduces the over-count to 175 thousand.

Table 5.2
Results for estimating total number of individuals with cancer. % deviate is the difference to NHIS estimate divided by the NHIS estimate
Table summary
This table displays the results of Results for estimating total number of individuals with cancer. % deviate is the difference to NHIS estimate divided by the NHIS estimate. The information is grouped by Estimator (appearing as row headers), (équation), SE, RMSE, % deviate from NHIS and SD (min, max) of weights (appearing as column headers).
Estimator T ^ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacPqpw0le9 v8qqaqFD0xXdHaVhbbf9y8WrFr0xc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peuj0lXxdrpe0db9Wqpepic9qr=xfr=xfr=tmeqabeqadiWa ceGabeqabeqabeqadeaakeaaceWGubGbaKaaaaa@3507@ SE RMSE % deviate from NHIS SD (min, max) of weights
NHIS 19,889,327 492,263 492,263 0.00% 5,913 (168; 93,244)
HTSRS 21,070,498 362,883 1,235,657 5.94% 0 (6,866; 6,866)
GREG1 20,254,449 375,064 523,438 1.84% 2,474 (-2,409; 16,679)
GREGMC 20,254,449 349,100 505,158 1.84% 269 (6,181; 7,326)
GREG2 20,281,603 367,900 537,802 1.97% 2,039 (-626; 13,947)
GREG2 MC 20,281,603 349,552 525,421 1.97% 260 (6,215; 7,291)
LASSO 20,064,671 347,586 389,309 0.88% 323 (5,786; 7,168)

As expected, the standard error of the NHIS estimate is the largest, as it properly incorporates complex survey design. If the calibration working model correctly captures the relationship between the outcome variable and the calibration variables, we anticipate that the calibration estimator standard errors to be smaller than HTSRS estimator’s. This is not the case for either of the GREG estimator, where the standard error is larger than HTSRS’s, although the RMSE is smaller due to the reduction in bias. In addition, the standard GREG estimator has a standard error about 2.0% greater than the backward selection GREG estimator, a feature offset by its estimated 6.6% reduction in bias (although this is insufficient to reduce RMSE); use of the model-assisted GREG estimator does reduce the standard error, and the root mean square error, by 5-7% and 2-3% respectively, over the standard GREG estimates. For LASSO calibration, we do observe a smaller standard error than HTSRS’s, even with the bootstrap variance estimate that tends to overestimate. Without using the correct design weights, LASSO calibration produced the most accurate estimate of a population total while providing the smallest standard error among the estimators in this application. This is in spite of the fact that the standard deviation of the LASSO calibration weights were only about one-seventh as variable as the GREG weights, reflected in the smaller standard error of the estimator itself and greatly reduced RMSE.


Date modified: