Comments on “Statistical inference with non-probability survey samples” – Miniaturizing data defect correlation: A versatile strategy for handling non-probability samples
Section 6. Counterbalancing sub-sampling

6.1  The devastating impact of data defect on effective sample size

A key finding, which has surprised many, from studying the data quality issue is how small the size of our “big data” is when we take into account the data defect. To prove this mathematically, we can equate the mean-squared error (MSE) of G ¯ W MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaaceWGhbGbaebadaWgaaWcbaGaam4vaa qabaaaaa@33AD@  in (2.1), with the MSE of a simple random sampling estimator of size n eff . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGUbWaaSbaaSqaaiaabwgacaqGMb GaaeOzaaqabaGccaGGUaaaaa@3656@  This yields (see Meng (2018) for derivation):

n eff f W 1 f W 1 E[ ρ R ˜ ,G 2 ] f W 1 f W 1 ρ R ˜ ,G 2 ,(6.1) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGUbWaaSbaaSqaaiaabwgacaqGMb GaaeOzaaqabaGccaaMe8UaaGjbVlabgIKi7kaaysW7caaMe8+aaSaa aeaacaWGMbWaaSbaaSqaaiaadEfaaeqaaaGcbaGaaGymaiaaysW7cq GHsislcaaMe8UaamOzamaaBaaaleaacaWGxbaabeaaaaGccaaMe8+a aSaaaeaacaaIXaaabaGaaeyraiaaykW7caaIBbGaeqyWdi3aa0baaS qaaiqadkfagaacaiaacYcacaaMc8Uaam4raaqaaiaaikdaaaGccaaI DbaaaiaaysW7caaMe8UaeyisISRaaGjbVlaaysW7daWcaaqaaiaadA gadaWgaaWcbaGaam4vaaqabaaakeaacaaIXaGaaGjbVlabgkHiTiaa ysW7caWGMbWaaSbaaSqaaiaadEfaaeqaaaaakiaaysW7daWcaaqaai aaigdaaeaacqaHbpGCdaqhaaWcbaGabmOuayaaiaGaaiilaiaaykW7 caWGhbaabaGaaGOmaaaaaaGccaaISaGaaGzbVlaaywW7caaMf8UaaG zbVlaaywW7caGGOaGaaGOnaiaac6cacaaIXaGaaiykaaaa@7916@

where f W = n W /N MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGMbWaaSbaaSqaaiaadEfaaeqaaO GaaGjbVlabg2da9iaaysW7daWcgaqaaiaad6gadaWgaaWcbaGaam4v aaqabaaakeaacaWGobaaaaaa@3ACC@  and the expectation E MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaqGfbaaaa@3289@  is with respect to the conditional distribution of R ˜ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaaceWGsbGbaGaaaaa@32A7@  given n W . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGUbWaaSbaaSqaaiaadEfaaeqaaO GaaiOlaaaa@3478@  It is worthwhile to note that this (conditional) distribution can involve all three types of probability discussed in Section 1.2 because the variations in R ˜ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaaceWGsbGbaGaaaaa@32A7@  can come from multiple sources. For example, in typical opinion surveys, there will be (1) design probability in the sampling indicator, (2) divine probability in formulating the non-response mechanism, and (3) device probability for estimating the mechanism and the weights.

Expression (6.1) is the weighted version/extension of the expression given in Meng (2018) with equal weights, which reveals the devastating impact of a seemingly tiny ddc. Suppose our sample is 1% of the population, and it suffers from a half-percent ddc. Applying (6.1) (with equal weights) with f W = MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGMbWaaSbaaSqaaiaadEfaaeqaaO GaaGjbVlabg2da9iaaykW7aaa@37DC@  0.01 and ρ R ˜ ,G = MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacqaHbpGCdaWgaaWcbaGabmOuayaaia GaaiilaiaaykW7caWGhbaabeaakiaaysW7cqGH9aqpcaaMc8oaaa@3BC2@  0.005 yields n eff MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGUbWaaSbaaSqaaiaabwgacaqGMb GaaeOzaaqabaGccaaMe8UaeyisISRaaGjbVdaa@3A6F@  404 regardless of the sample size n R . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGUbWaaSbaaSqaaiaadkfaaeqaaO GaaiOlaaaa@3473@  In the case of the 2020 US presidential election, 1% of the voting population is about 1.55 million people, and hence the loss of sample size due to a half percent ddc is about 1 - (404 / 1,550,000) > 99.97%. Such seemingly impossible losses have been reported in both election studies (Meng, 2018) and COVID vaccination studies (Bradley et al., 2021). A most devastating consequence of such losses is the “big data paradox”: the larger the (apparent) data size, the surer we fool ourselves because our false confidence (in both technical and literal sense) goes up with the erroneous data size, while the actual coverage probability of the incorrectly constructed confidence intervals become vanishingly small (Meng, 2018; Msaouel, 2022).

A positive implication from this revelation, however, is that we can trade much data quantity for data quality, and still end up having statistically more accurate estimates. Of course, in order to reduce the bias, we will need some information about it. If we have reliable information on the value of ddc, we can directly adjust for the bias in estimating the population average corresponding to the ddc, for example by a Bayesian approach, similar to that taken by Isakov and Kuriwaki (2020) in their scenario analysis. Furthermore, if we have sufficient information to construct reliable weights, we can use the weights to adjust for selection biases as commonly done. Nevertheless, even in such cases, it may still be useful to create a representative miniature of the population out of a biased sample for general purposes, which for example can eliminate many practitioners’ anxiety and potential mistakes for not knowing how to properly use the weights. Indeed, few really know how to deal with weights, because “Survey weighting is a mess” (Gelman, 2007).

However, creating a representative miniature out of a biased sample in general is a challenging task, especially because ddc can (and will) vary with the variable of interest. Nevertheless, just as weighting is popular tool despite it being far from perfect, let us explore representative miniaturization and see how far we can push the idea. The following example therefore is purely for brainstorming purposes, by looking into a common but challenging scenario, where we have reasonable information or understanding on the direction of the bias, that is, the sign of the ddc, but rather vague information about its magnitude. A good example is non-representativeness of election polls because voters tend to not want to disclose their preferences when they plan to vote for a socially unpopular candidate; we therefore know the direction of the bias, but not much about its degree other than some rough guesses (e.g., a range of 10 percentage points).

6.2  Creating a less biased sub-sample

The basic idea is to use such partial information about the selection bias to design a biased sub-sampling scheme to counterbalance the bias in the original sample, such that the resulting sub-samples have a high likelihood to be less biased than the original sample from our target population. That is, we create a sub-sampling indicator S I , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGtbWaaSbaaSqaaiaadMeaaeqaaO Gaaiilaaaa@344D@  such that with high likelihood, the correlation between S I R I MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGtbWaaSbaaSqaaiaadMeaaeqaaO GaamOuamaaBaaaleaacaWGjbaabeaaaaa@356E@  and G I MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGhbWaaSbaaSqaaiaadMeaaeqaaa aa@3387@  is reduced, compared to the original ρ R,G , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacqaHbpGCdaWgaaWcbaGaamOuaiaacY cacaaMc8Uaam4raaqabaGccaGGSaaaaa@3845@  to such a degree that it will compensate for the loss of sample size and hence reduce the MSE of our estimator (e.g., the sample average). We say with high likelihood, in its non-technical meaning, because without full information on the response/recording mechanism, we can never guarantee such a counterbalance sub-sampling (CBS) would always do better. However, with judicious execution, we can reduce the likelihood of making serious mistakes.

To illustrate, consider the case where y MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWG5baaaa@32BF@  is binary. Let Δ= r 1 r 0 , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacqqHuoarcaaMe8Uaeyypa0JaaGjbVl aadkhadaWgaaWcbaGaaGymaaqabaGccaaMe8UaeyOeI0IaaGjbVlaa dkhadaWgaaWcbaGaaGimaaqabaGccaGGSaaaaa@3FCD@  where r y MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGYbWaaSbaaSqaaiaadMhaaeqaaa aa@33E2@  is the propensity of responding/reporting for individuals whose responses will take value y: MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWG5bGaaiOoaaaa@337D@   r y = Pr I ( R I =1| y I =y). MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGYbWaaSbaaSqaaiaadMhaaeqaaO GaaGjbVlabg2da9iaaysW7ciGGqbGaaiOCamaaBaaaleaacaWGjbaa beaakiaaysW7caaIOaGaamOuamaaBaaaleaacaWGjbaabeaakiaays W7cqGH9aqpcaaMe8UaaGymaiaaysW7daabbeqaaiaaykW7caWG5bWa aSbaaSqaaiaadMeaaeqaaOGaaGjbVlabg2da9iaaysW7aiaawEa7ai aadMhacaaIPaGaaiOlaaaa@5103@  If the sample is representative, then like ρ R,G , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacqaHbpGCdaWgaaWcbaGaamOuaiaacY cacaaMc8Uaam4raaqabaGccaGGSaaaaa@3845@   Δ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacqqHuoaraaa@3327@  is miniaturized, meaning that it is on the order of N 1/2 . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGobWaaWbaaSqabeaacqGHsislda WcgaqaaiaaigdaaeaacaaIYaaaaaaakiaac6caaaa@35F7@  This is most clearly seen via the easily verifiable identity (see (4.1) of Meng, 2018)

Δ= Cov I ( y I , R I ) p(1p) = ρ R,y f R (1 f R ) p(1p) ,(6.2) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacqqHuoarcaaMe8UaaGjbVlabg2da9i aaysW7caaMe8+aaSaaaeaacaqGdbGaae4BaiaabAhadaWgaaWcbaGa amysaaqabaGccaaMc8UaaGikaiaadMhadaWgaaWcbaGaamysaaqaba GccaaISaGaaGjbVlaadkfadaWgaaWcbaGaamysaaqabaGccaaIPaaa baGaamiCaiaaykW7caaIOaGaaGymaiaaysW7cqGHsislcaaMe8Uaam iCaiaaiMcaaaGaaGjbVlaaysW7caaI9aGaaGjbVlaaysW7cqaHbpGC daWgaaWcbaGaamOuaiaacYcacaaMc8UaamyEaaqabaGccaaMe8+aaO aaaeaacaaMc8+aaSaaaeaacaWGMbWaaSbaaSqaaiaadkfaaeqaaOGa aGPaVlaaiIcacaaIXaGaaGjbVlabgkHiTiaaysW7caWGMbWaaSbaaS qaaiaadkfaaeqaaOGaaGykaaqaaiaadchacaaMc8UaaGikaiaaigda caaMe8UaeyOeI0IaaGjbVlaadchacaaIPaaaaaWcbeaakiaaiYcaca aMf8UaaGzbVlaaywW7caaMf8UaaGzbVlaacIcacaaI2aGaaiOlaiaa ikdacaGGPaaaaa@8260@

where p= Pr I ( y I =1) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGWbGaaGjbVlabg2da9iaaysW7ci GGqbGaaiOCamaaBaaaleaacaWGjbaabeaakiaaykW7caaIOaGaamyE amaaBaaaleaacaWGjbaabeaakiaaysW7cqGH9aqpcaaMe8UaaGymai aaiMcaaaa@4373@  and f R = Pr I ( R I =1), MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGMbWaaSbaaSqaaiaadkfaaeqaaO GaaGjbVlabg2da9iaaysW7ciGGqbGaaiOCamaaBaaaleaacaWGjbaa beaakiaaykW7caaIOaGaamOuamaaBaaaleaacaWGjbaabeaakiaays W7cqGH9aqpcaaMe8UaaGymaiaaiMcacaGGSaaaaa@44FF@  which is the original sampling rate. A key ingredient of CBS is to determine s y = P I ( S I =1| y I =y, R I =1) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGZbWaaSbaaSqaaiaadMhaaeqaaO GaaGypaiaadcfadaWgaaWcbaGaamysaaqabaGccaaMc8UaaGikaiaa dofadaWgaaWcbaGaamysaaqabaGccaaMe8Uaeyypa0JaaGjbVlaaig dacaaMe8+aaqqabeaacaaMc8UaamyEamaaBaaaleaacaWGjbaabeaa aOGaay5bSdGaaGjbVlabg2da9iaaysW7caWG5bGaaGilaiaaysW7ca WGsbWaaSbaaSqaaiaadMeaaeqaaOGaaGjbVlabg2da9iaaysW7caaI XaGaaGykaaaa@54FA@  for y=0,1, MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWG5bGaaGjbVlabg2da9iaaysW7ca qGWaGaaeilaiaaysW7caqGXaGaaeilaaaa@3B31@  that is, the sub-sampling probabilities of individuals who reported y=1 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWG5bGaaGjbVlabg2da9iaaysW7ca aIXaaaaa@379A@  and y=0, MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWG5bGaaGjbVlabg2da9iaaysW7ca aIWaGaaiilaaaa@3849@  respectively.

To determine the beneficial choices, let f S = Pr I ( S I =1| R I =1) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGMbWaaSbaaSqaaiaadofaaeqaaO GaaGjbVlabg2da9iaaysW7ciGGqbGaaiOCamaaBaaaleaacaWGjbaa beaakiaaykW7caaIOaGaam4uamaaBaaaleaacaWGjbaabeaakiaays W7cqGH9aqpcaaMe8UaaGymaiaaysW7daabbeqaaiaaykW7caWGsbWa aSbaaSqaaiaadMeaaeqaaOGaaGjbVlabg2da9iaaysW7caaIXaaaca GLhWoacaaIPaaaaa@4FB4@  be the sub-sampling rate, and Δ S = s 1 r 1 s 0 r 0 . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacqqHuoardaWgaaWcbaGaam4uaaqaba GccaaMe8Uaeyypa0JaaGjbVlaadohadaWgaaWcbaGaaGymaaqabaGc caaMc8UaamOCamaaBaaaleaacaaIXaaabeaakiaaysW7cqGHsislca aMe8Uaam4CamaaBaaaleaacaaIWaaabeaakiaaykW7caWGYbWaaSba aSqaaiaaicdaaeqaaOGaaiOlaaaa@47C4@  Then by applying (2.2) (with equal weights) and (6.2) to both the sample average and the sub-sample average, we see that the sub-sample average has smaller (actual) error in magnitude if and only if

( Δ S f S f R ) 2 < ( Δ f R ) 2 f S 2 > ( Δ S Δ ) 2 .(6.3) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaadaqadaqaamaalaaabaGaeuiLdq0aaS baaSqaaiaadofaaeqaaaGcbaGaamOzamaaBaaaleaacaWGtbaabeaa kiaadAgadaWgaaWcbaGaamOuaaqabaaaaaGccaGLOaGaayzkaaWaaW baaSqabeaacaaIYaaaaOGaaGjbVlaaysW7cqGH8aapcaaMe8UaaGjb VpaabmaabaWaaSaaaeaacqqHuoaraeaacaWGMbWaaSbaaSqaaiaadk faaeqaaaaaaOGaayjkaiaawMcaamaaCaaaleqabaGaaGOmaaaakiaa ysW7caaMe8Uaeyi1HSTaaGjbVlaaysW7caWGMbWaa0baaSqaaiaado faaeaacaaIYaaaaOGaaGjbVlaaysW7cqGH+aGpcaaMe8UaaGjbVpaa bmaabaWaaSaaaeaacqqHuoardaWgaaWcbaGaam4uaaqabaaakeaacq qHuoaraaaacaGLOaGaayzkaaWaaWbaaSqabeaacaaIYaaaaOGaaGOl aiaaywW7caaMf8UaaGzbVlaaywW7caaMf8UaaiikaiaaiAdacaGGUa GaaG4maiaacMcaaaa@6CC1@

Writing r= r 1 / r 0 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGYbGaaGjbVlabg2da9iaaysW7da WcgaqaaiaadkhadaWgaaWcbaGaaGymaaqabaaakeaacaWGYbWaaSba aSqaaiaaicdaaeqaaaaaaaa@3AB3@  and s= s 1 / s 0 , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGZbGaaGjbVlabg2da9iaaysW7da WcgaqaaiaadohadaWgaaWcbaGaaGymaaqabaaakeaacaWGZbWaaSba aSqaaiaaicdaaeqaaaaakiaacYcaaaa@3B70@  the right-hand side of (6.3) becomes

[ s p * +(1 p * ) ] 2 > ( rs1 r1 ) 2 ,(6.4) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaadaWadeqaaiaadohacaWGWbWaaWbaaS qabeaacaGGQaaaaOGaaGjbVlabgUcaRiaaysW7caaIOaGaaGymaiaa ysW7cqGHsislcaaMe8UaamiCamaaCaaaleqabaGaaiOkaaaakiaaiM caaiaawUfacaGLDbaadaahaaWcbeqaaiaaikdaaaGccaaMe8UaaGjb Vlabg6da+iaaysW7caaMe8+aaeWaaeaadaWcaaqaaiaadkhacaWGZb GaaGjbVlabgkHiTiaaysW7caaIXaaabaGaamOCaiaaysW7cqGHsisl caaMe8UaaGymaaaaaiaawIcacaGLPaaadaahaaWcbeqaaiaaikdaaa GccaaISaGaaGzbVlaaywW7caaMf8UaaGzbVlaaywW7caGGOaGaaGOn aiaac6cacaaI0aGaaiykaaaa@65AC@

where p * = Pr I ( y I =1| R I =1) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGWbWaaWbaaSqabeaacaGGQaaaaO GaaGjbVlabg2da9iaaysW7ciGGqbGaaiOCamaaBaaaleaacaWGjbaa beaakiaaykW7caaIOaGaamyEamaaBaaaleaacaWGjbaabeaakiaays W7cqGH9aqpcaaMe8UaaGymaiaaysW7daabbeqaaiaadkfadaWgaaWc baGaamysaaqabaaakiaawEa7aiaaysW7cqGH9aqpcaaMe8UaaGymai aaiMcaaaa@4E30@  is observed in the original sample, which should remind us that p * MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGWbWaaWbaaSqabeaacaGGQaaaaa aa@3391@  may be rather different from the p MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGWbaaaa@32B6@  we seek, because of the biased R MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGsbaaaa@3298@  -mechanism.

An immediate choice to satisfy (6.4) is to set s= r 1 , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGZbGaaGjbVlabg2da9iaaysW7ca WGYbWaaWbaaSqabeaacqGHsislcaaIXaaaaOGaaiilaaaa@3A5F@  which of course typically is unrealistic because if we know the value of r, MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGYbGaaiilaaaa@3368@  then the problem would be a lot simpler. To explore how much leeway we have in deviating from this ideal choice, let δ=r1, MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacqaH0oazcaaMe8Uaeyypa0JaaGjbVl aadkhacaaMe8UaeyOeI0IaaGjbVlaaigdacaGGSaaaaa@3DEF@  we can then show that (6.4) is equivalent to

(s1){ [1+(1+ p * )δ](s1)+2δ }<0.(6.5) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaaIOaGaam4CaiaaysW7cqGHsislca aMe8UaaGymaiaaiMcacaaMe8+aaiWaaeaacaaIBbGaaGymaiaaysW7 cqGHRaWkcaaMe8UaaGikaiaaigdacaaMe8Uaey4kaSIaaGjbVlaadc hadaahaaWcbeqaaiaacQcaaaGccaaIPaGaaGjbVlabes7aKjaai2fa caaMe8UaaGikaiaadohacaaMe8UaeyOeI0IaaGjbVlaaigdacaaIPa GaaGjbVlabgUcaRiaaysW7caaIYaGaeqiTdqgacaGL7bGaayzFaaGa aGjbVlaaysW7cqGH8aapcaaMe8UaaGjbVlaaicdacaaIUaGaaGzbVl aaywW7caaMf8UaaGzbVlaaywW7caGGOaGaaGOnaiaac6cacaaI1aGa aiykaaaa@714B@

This tells precisely the permissible choices of s MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGZbaaaa@32B9@  without over-correcting (in the magnitude of the resulting bias): 

(i)  When r>1, MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGYbGaaGjbVlabg6da+iaaysW7ca aIXaGaaiilaaaa@3845@  i.e., δ>0, MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacqaH0oazcaaMe8UaeyOpa4JaaGjbVl aaicdacaGGSaaaaa@38F2@  we can take any s MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGZbaaaa@32B9@  such that

[1(1 p * )δ] + 1+(1+ p * )δ s<1;(6.6) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaadaWcaaqaaiaaiUfacaaIXaGaaGjbVl abgkHiTiaaysW7caaIOaGaaGymaiaaysW7cqGHsislcaaMe8UaamiC amaaCaaaleqabaGaaiOkaaaakiaaiMcacaaMe8UaeqiTdqMaaGyxam aaBaaaleaacqGHRaWkaeqaaaGcbaGaaGymaiaaysW7cqGHRaWkcaaM e8UaaGikaiaaigdacaaMe8Uaey4kaSIaaGjbVlaadchadaahaaWcbe qaaiaacQcaaaGccaaIPaGaaGjbVlabes7aKbaacaaMe8UaaGjbVlab gsMiJkaaysW7caaMe8Uaam4CaiaaysW7caaMe8UaeyipaWJaaGjbVl aaysW7caaIXaGaaG4oaiaaywW7caaMf8UaaGzbVlaaywW7caaMf8Ua aiikaiaaiAdacaGGUaGaaGOnaiaacMcaaaa@7172@

(ii)  When r<1, MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGYbGaaGjbVlabgYda8iaaysW7ca aIXaGaaiilaaaa@3841@  i.e., δ<0, MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacqaH0oazcaaMe8UaeyipaWJaaGjbVl aaicdacaGGSaaaaa@38EE@  we can take any s MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGZbaaaa@32B9@  such that

1<s 1(1 p * )δ [1+(1+ p * )δ] + .(6.7) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaaIXaGaaGjbVlaaysW7cqGH8aapca aMe8UaaGjbVlaadohacaaMe8UaaGjbVlabgsMiJkaaysW7caaMe8+a aSaaaeaacaaIXaGaaGjbVlabgkHiTiaaysW7caaIOaGaaGymaiaays W7cqGHsislcaaMe8UaamiCamaaCaaaleqabaGaaiOkaaaakiaaiMca caaMe8UaeqiTdqgabaGaaG4waiaaigdacaaMe8Uaey4kaSIaaGjbVl aaiIcacaaIXaGaaGjbVlabgUcaRiaaysW7caWGWbWaaWbaaSqabeaa caGGQaaaaOGaaGykaiaaysW7cqaH0oazcaaIDbWaaSbaaSqaaiabgU caRaqabaaaaOGaaGOlaiaaywW7caaMf8UaaGzbVlaaywW7caaMf8Ua aiikaiaaiAdacaGGUaGaaG4naiaacMcaaaa@7166@

This pair of results confirms a number of our intuitions, but also offers some qualifications that are not so obvious. Since we sub-sample to compensate for the bias in the original sample, s MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGZbaaaa@32B9@  and r MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGYbaaaa@32B8@  must stay on the opposite side of 1, i.e., (s1)(r1)=(s1)δ<0, MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaaIOaGaam4CaiaaysW7cqGHsislca aMe8UaaGymaiaaiMcacaaMe8UaaGikaiaadkhacaaMe8UaeyOeI0Ia aGjbVlaaigdacaaIPaGaaGjbVlabg2da9iaaysW7caaIOaGaam4Cai aaysW7cqGHsislcaaMe8UaaGymaiaaiMcacaaMe8UaeqiTdqMaaGjb VlabgYda8iaaysW7caaIWaGaaiilaaaa@5584@  as seen in (6.6)-(6.7). To prevent over corrections, some limits are needed, but it is also possible that the initial bias is so bad that no sub-sampling scheme can make things worse, which is reflected by the positivizing function [x] + MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaaIBbGaamiEaiaai2fadaWgaaWcba Gaey4kaScabeaaaaa@3598@  in the two expressions above. However, the expressions for the limits as well as for the thresholds to activate the positivizing functions are not so obvious. Nor is it obvious that these expressions depend on the unknown p MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGWbaaaa@32B6@  indirectly via the observed p * , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGWbWaaWbaaSqabeaacaGGQaaaaO Gaaiilaaaa@344B@  and hence only prior knowledge of r MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGYbaaaa@32B8@  is required for implementing or assessing CBS.

This observation suggests that it is possible to implement a beneficial CBS when we can borrow information from other surveys (or studies) where the response/recording behaviors are of similar nature. For example, we may learn that a previous similar survey had r= MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGYbGaaGjbVlabg2da9iaaykW7aa a@36D6@  1.5 (e.g., those with y=1 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWG5bGaaGjbVlabg2da9iaaysW7ca aIXaaaaa@379A@  had 6% of chance to be recorded, and those with y=0 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWG5bGaaGjbVlabg2da9iaaysW7ca aIWaaaaa@3799@  had only 4% chance). Taking into account the uncertainty in the similarity between the two surveys, we might feel comfortable to place (1.2, 1.8) as the plausible range for r MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGYbaaaa@32B8@  in the current study. Suppose we observe p * = MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGWbWaaWbaaSqabeaacaGGQaaaaO GaaGjbVlabg2da9iaaykW7aaa@37B9@  0.6, this means that the maximum ‒ over the range r MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGYbGaaGjbVlabgIGiolaaykW7aa a@3754@  (1.2, 1.8) ‒ of the lower bound on the permissible s MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGZbaaaa@32B9@  as given in (6.6) is

[1(10.6)(r1)] + 1+1.6(r1) = [1.40.4r] + 1.6r0.6 1.40.4×1.2 1.6×1.20.6 =0.7.(6.8) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaadaWcaaqaaGGaaiab=TfaBjaaigdaca aMe8UaeyOeI0IaaGjbVlaaiIcacaaIXaGaaGjbVlabgkHiTiaaysW7 caaIWaGaaGOlaiaaiAdacaaIPaGaaGjbVlaaiIcacaWGYbGaaGjbVl abgkHiTiaaysW7caaIXaGaaGykaiab=1faDnaaBaaaleaacqGHRaWk aeqaaaGcbaGaaGymaiaaysW7cqGHRaWkcaaMe8UaaGymaiaai6caca aI2aGaaGPaVlaaiIcacaWGYbGaaGjbVlabgkHiTiaaysW7caaIXaGa aGykaaaacaaMe8UaaGjbVlabg2da9iaaysW7caaMe8+aaSaaaeaacq WFBbWwcaaIXaGaaGOlaiaaisdacaaMe8UaeyOeI0IaaGjbVlaaicda caaIUaGaaGinaiaadkhacqWFDbqxdaWgaaWcbaGaey4kaScabeaaaO qaaiaaigdacaaIUaGaaGOnaiaadkhacaaMe8UaeyOeI0IaaGjbVlaa icdacaaIUaGaaGOnaaaacaaMe8UaaGjbVlabgsMiJkaaysW7caaMe8 +aaSaaaeaacaaIXaGaaGOlaiaaisdacaaMe8UaeyOeI0IaaGjbVlaa icdacaaIUaGaaGinaiaaysW7cqGHxdaTcaaMe8UaaGymaiaai6caca aIYaaabaGaaGymaiaai6cacaaI2aGaaGjbVlabgEna0kaaysW7caaI XaGaaGOlaiaaikdacaaMe8UaeyOeI0IaaGjbVlaaicdacaaIUaGaaG OnaaaacaaMe8UaaGjbVlaai2dacaaMe8UaaGjbVlaaicdacaaIUaGa aG4naiaai6cacaaMf8UaaGzbVlaaywW7caaMf8UaaGzbVlaacIcaca aI2aGaaiOlaiaaiIdacaGGPaaaaa@B529@

Therefore, as long as we choose s[0.7,1), MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGZbGaaGjbVlabgIGiolaaysW7ca aIBbGaaGimaiaai6cacaaI3aGaaGilaiaaysW7caaIXaGaaGykaiaa cYcaaaa@3ED0@  we are unlikely to over-correct. The price we pay for this robustness is that the resulting sub-sample is not as good quality as it can be, for example, when the underlying r MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGYbaaaa@32B8@  for the current study is indeed 1.5 (in expectation). Choosing any s[0.7,1) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGZbGaaGjbVlabgIGiolaaysW7ca aIBbGaaGimaiaai6cacaaI3aGaaGilaiaaysW7caaIXaGaaGykaaaa @3E20@  will not provide the full correction as provided by s=1/r = MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGZbGaaGjbVlabg2da9iaaysW7da WcgaqaaiaaigdaaeaacaWGYbaaaiaaysW7cqGH9aqpcaaMc8oaaa@3CBF@  0.67, that is, the sub-sample average will still have a positive bias but with a smaller MSE compared to the original sample average. Of course both the feasibility and effectiveness of such CBS need to be carefully investigated before it can be recommended for general consumption, especially going beyond binary y. MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWG5bGaaiOlaaaa@3371@  The literature on inverse sampling (Hinkins, Oh and Scheuren, 1997; Rao, Scott and Benhin, 2003) is of great relevance for such investigations, because it also aims to produce simple random samples via subsampling, albeit with a different motivation (to turn complex surveys into simple ones for ease of analysis).


Date modified: