Integration of data from probability surveys and big found data for finite population inference using mass imputation
Section 2. Basic setup

2.1   Notation: Two data sources

Let F N = { ( X i , Y i ): i U } MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaGWaciab=zeagnaaBaaaleaacaWGobaabeaakiaaysW7 caaI9aGaaGjbVpaacmqabaWaaeWabeaacaWHybWaaSbaaSqaaiaadM gaaeqaaOGaaGilaiaaysW7caWGzbWaaSbaaSqaaiaadMgaaeqaaaGc caGLOaGaayzkaaGaaGPaVlaaiQdacaaMe8UaamyAaiaaysW7cqGHii IZcaaMe8UaamyvaaGaay5Eaiaaw2haaaaa@5522@ with U = { 1, , N } MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaiaadwfacaaMe8UaaGypaiaaysW7daGadeqaaiaaigda caaISaGaaGjbVlablAciljaaiYcacaaMe8UaamOtaaGaay5Eaiaaw2 haaaaa@48F5@ denote a finite population, where X i = ( X i 1 , , X i p ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaiaahIfadaWgaaWcbaGaamyAaaqabaGccaaMe8UaaGyp aiaaysW7daqadeqaaiaadIfadaqhaaWcbaGaamyAaaqaaiaaigdaaa GccaaISaGaaGjbVlablAciljaaiYcacaaMe8UaamiwamaaDaaaleaa caWGPbaabaGaamiCaaaaaOGaayjkaiaawMcaaaaa@4D9E@ is a p MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaiaadchaaaa@3BC7@ -dimensional vector of covariates, and Y i MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaiaadMfadaWgaaWcbaGaamyAaaqabaaaaa@3CCA@ is the study variable. We assume that F N MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaGWaciab=zeagnaaBaaaleaacaWGobaabeaaaaa@3CEF@ is a random sample from a superpopulation model ζ , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaiabeA7a6jaacYcaaaa@3D3F@ and N MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaiaad6eaaaa@3BA5@ is known. Our objective is to estimate the general finite population parameter μ g = N 1 i = 1 N g ( Y i ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaiabeY7aTnaaBaaaleaacaWGNbaabeaakiaaysW7caaI 9aGaaGjbVlaad6eadaahaaWcbeqaaiabgkHiTiaaigdaaaGcdaaeWa qaaiaadEgadaqadeqaaiaadMfadaWgaaWcbaGaamyAaaqabaaakiaa wIcacaGLPaaaaSqaaiaadMgacaaI9aGaaGymaaqaaiaad6eaa0Gaey yeIuoaaaa@4DFA@ for some known g ( ) . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaiaadEgadaqadeqaaiaaygW7caaMb8UaeyyXICTaaGza VdGaayjkaiaawMcaaiaac6caaaa@44E2@ For example, if g ( Y ) = Y , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaiaadEgadaqadeqaaiaaygW7caWGzbGaaGzaVdGaayjk aiaawMcaaiaaysW7caaI9aGaaGjbVlaadMfacaGGSaaaaa@46A9@ μ g = N 1 i = 1 N Y i MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaiabeY7aTnaaBaaaleaacaWGNbaabeaakiaaysW7caaI 9aGaaGjbVlaad6eadaahaaWcbeqaaiabgkHiTiaaigdaaaGcdaaeWa qaaiaadMfadaWgaaWcbaGaamyAaaqabaaabaGaamyAaiaai2dacaaI XaaabaGaamOtaaqdcqGHris5aaaa@4B6F@ is the population mean of Y . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaiaadMfacaGGUaaaaa@3C62@ If g ( Y ) = 1 ( Y < c ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaiaadEgadaqadeqaaiaaygW7caWGzbGaaGzaVdGaayjk aiaawMcaaiaaysW7caaI9aGaaGjbVlaahgdadaqadeqaaiaadMfaca aMe8UaaGipaiaaysW7caWGJbaacaGLOaGaayzkaaaaaa@4D05@ for some constant c , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaiaadogacaGGSaaaaa@3C6A@ μ g = N 1 i = 1 N 1 ( Y i < c ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaiabeY7aTnaaBaaaleaacaWGNbaabeaakiaaysW7caaI 9aGaaGjbVlaad6eadaahaaWcbeqaaiabgkHiTiaaigdaaaGcdaaeWa qaaiaahgdadaqadeqaaiaadMfadaWgaaWcbaGaamyAaaqabaGccaaM e8UaaGipaiaaysW7caWGJbaacaGLOaGaayzkaaaaleaacaWGPbGaaG ypaiaaigdaaeaacaWGobaaniabggHiLdaaaa@5290@ is the population proportion of Y MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaiaadMfaaaa@3BB0@ less than c . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaiaadogacaGGUaaaaa@3C6C@

Suppose that there are two data sources, one from a probability sample, referred to as Sample A, and the other from a big data source, referred to as Sample B. Table 2.1 illustrates the observed data structure. Sample A contains observations O A = { ( d i = π i 1 , X i ): i A } MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaGWaciab=9eapnaaBaaaleaacaWGbbaabeaakiaaysW7 caaI9aGaaGjbVpaacmqabaWaaeWabeaacaWGKbWaaSbaaSqaaiaadM gaaeqaaOGaaGjbVlaai2dacaaMe8UaeqiWda3aa0baaSqaaiaadMga aeaacqGHsislcaaIXaaaaOGaaGilaiaaysW7caWHybWaaSbaaSqaai aadMgaaeqaaaGccaGLOaGaayzkaaGaaGPaVlaaiQdacaaMe8UaamyA aiaaysW7cqGHiiIZcaaMe8UaamyqaaGaay5Eaiaaw2haaaaa@5D89@ with sample size n = | A | , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaiaad6gacaaMe8UaaGypaiaaysW7daabdeqaaiaaykW7 caWGbbGaaGPaVdGaay5bSlaawIa7aiaaiYcaaaa@475B@ where π i = P ( i A ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaiabec8aWnaaBaaaleaacaWGPbaabeaakiaaysW7caaI 9aGaaGjbVlaadcfadaqadeqaaiaadMgacaaMe8UaeyicI4SaaGjbVl aadgeaaiaawIcacaGLPaaaaaa@4A45@ is known throughout Sample A, and Sample B contains observations O B = { ( X i , Y i ): i B } MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaGWaciab=9eapnaaBaaaleaacaWGcbaabeaakiaaysW7 caaI9aGaaGjbVpaacmqabaWaaeWabeaacaWHybWaaSbaaSqaaiaadM gaaeqaaOGaaGilaiaaysW7caWGzbWaaSbaaSqaaiaadMgaaeqaaaGc caGLOaGaayzkaaGaaGPaVlaaiQdacaaMe8UaamyAaiaaysW7cqGHii IZcaaMe8UaamOqaaGaay5Eaiaaw2haaaaa@5515@ with sample size N B = | B | . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaiaad6eadaWgaaWcbaGaamOqaaqabaGccaaMe8UaaGyp aiaaysW7daabdeqaaiaaykW7caWGcbGaaGPaVdGaay5bSlaawIa7ai aac6caaaa@4835@ Often the probability sample contains many other items but we only use those items overlapping with our big data.  Although the big data source has a large sample size, the sampling mechanism is often unknown, and we cannot compute the first-order inclusion probability for Horvitz-Thompson estimation. The naive estimators without adjusting for the sampling process are subject to selection biases. On the other hand, although the probability sample with sampling weights represents the finite population, it does not observe the study variable.


Table 2.1
Two data sources. “ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacPqpw0le9v8qqaqpepeeaY= Hhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=bYP0xH8peeu0xXd crpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGacaGaaeqabaWaci qaeaaakeaacqGIAiI1aaa@3D4F@ ” and “?” indicate observed and unobserved data, respectively
Table summary
This table displays the results of Two data sources. “ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacPqpw0le9v8qqaqpepeeaY= Hhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=bYP0xH8peeu0xXd crpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGacaGaaeqabaWaci qaeaaakeaacqGIAiI1aaa@3D4F@ ” and “?” indicate observed and unobserved data Sample weight d= π -1 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacPqpw0le9v8qqaqpepeeaY= Hhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=bYP0xH8peeu0xXd crpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGacaGaaeqabaWace WaeaaakeaacaGIKbGaaOypaiaakc8adaahaaWcbeqaaiaah2cacaWH Xaaaaaaa@4053@ , Covariate X MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacPqpw0le9v8qqaqpepeeaY= Hhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=bYP0xH8peeu0xXd crpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGacaGaaeqabaWace aaeaaakeaacaWHybaaaa@3C8C@ and Study Variable Y MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacPqpw0le9v8qqaqpepeeaY= Hhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=bYP0xH8peeu0xXd crpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGacaGaaeqabaWace WaeaaakeaacaGIzbaaaa@3C93@ (appearing as column headers).
Sample weight d= π -1 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacPqpw0le9v8qqaqpepeeaY= Hhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=bYP0xH8peeu0xXd crpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGacaGaaeqabaWace WaeaaakeaacaGIKbGaaOypaiaakc8adaahaaWcbeqaaiaah2cacaWH Xaaaaaaa@4053@ Covariate X MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacPqpw0le9v8qqaqpepeeaY= Hhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=bYP0xH8peeu0xXd crpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGacaGaaeqabaWace aaeaaakeaacaWHybaaaa@3C8C@ Study Variable Y MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacPqpw0le9v8qqaqpepeeaY= Hhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=bYP0xH8peeu0xXd crpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGacaGaaeqabaWace WaeaaakeaacaGIzbaaaa@3C93@
Probability Sample
O A MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacPqpw0le9v8qqaqpepeeaY= Hhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=bYP0xH8peeu0xXd crpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGacaGaaeqabaWace aaeaaakeaaimGacqWFpbWtdaWgaaWcbaGaamyqaaqabaaaaa@3DCD@
1 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacPqpw0le9v8qqaqpepeeaY= Hhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=bYP0xH8peeu0xXd crpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGacaGaaeqabaWace aaeaaakeaacqGHAiI1aaa@3D49@ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacPqpw0le9v8qqaqpepeeaY= Hhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=bYP0xH8peeu0xXd crpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGacaGaaeqabaWace aaeaaakeaacqGHAiI1aaa@3D49@ ?
MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacPqpw0le9v8qqaqpepeeaY= Hhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=bYP0xH8peeu0xXd crpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGacaGaaeqabaWace aaeaaakeaacqWIUlstaaa@3D99@ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacPqpw0le9v8qqaqpepeeaY= Hhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=bYP0xH8peeu0xXd crpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGacaGaaeqabaWace aaeaaakeaacqWIUlstaaa@3D99@ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacPqpw0le9v8qqaqpepeeaY= Hhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=bYP0xH8peeu0xXd crpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGacaGaaeqabaWace aaeaaakeaacqWIUlstaaa@3D99@ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacPqpw0le9v8qqaqpepeeaY= Hhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=bYP0xH8peeu0xXd crpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGacaGaaeqabaWace aaeaaakeaacqWIUlstaaa@3D99@
n MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacPqpw0le9v8qqaqpepeeaY= Hhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=bYP0xH8peeu0xXd crpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGacaGaaeqabaWace aaeaaakeaacaWGUbaaaa@3C9E@ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacPqpw0le9v8qqaqpepeeaY= Hhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=bYP0xH8peeu0xXd crpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGacaGaaeqabaWace aaeaaakeaacqGHAiI1aaa@3D49@ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacPqpw0le9v8qqaqpepeeaY= Hhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=bYP0xH8peeu0xXd crpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGacaGaaeqabaWace aaeaaakeaacqGHAiI1aaa@3D49@ ?
Big Data Sample
O B MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacPqpw0le9v8qqaqpepeeaY= Hhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=bYP0xH8peeu0xXd crpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGacaGaaeqabaWace aaeaaakeaaimGacqWFpbWtdaWgaaWcbaGaamOqaaqabaaaaa@3DCD@
1 ? MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacPqpw0le9v8qqaqpepeeaY= Hhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=bYP0xH8peeu0xXd crpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGacaGaaeqabaWace aaeaaakeaacqGHAiI1aaa@3D49@ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacPqpw0le9v8qqaqpepeeaY= Hhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=bYP0xH8peeu0xXd crpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGacaGaaeqabaWace aaeaaakeaacqGHAiI1aaa@3D49@
MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacPqpw0le9v8qqaqpepeeaY= Hhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=bYP0xH8peeu0xXd crpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGacaGaaeqabaWace aaeaaakeaacqWIUlstaaa@3D99@ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacPqpw0le9v8qqaqpepeeaY= Hhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=bYP0xH8peeu0xXd crpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGacaGaaeqabaWace aaeaaakeaacqWIUlstaaa@3D99@ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacPqpw0le9v8qqaqpepeeaY= Hhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=bYP0xH8peeu0xXd crpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGacaGaaeqabaWace aaeaaakeaacqWIUlstaaa@3D99@ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacPqpw0le9v8qqaqpepeeaY= Hhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=bYP0xH8peeu0xXd crpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGacaGaaeqabaWace aaeaaakeaacqWIUlstaaa@3D99@
N B MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacPqpw0le9v8qqaqpepeeaY= Hhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=bYP0xH8peeu0xXd crpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGacaGaaeqabaWace aaeaaakeaacaWGobWaaSbaaSqaaiaadkeaaeqaaaaa@3D71@ ? MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacPqpw0le9v8qqaqpepeeaY= Hhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=bYP0xH8peeu0xXd crpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGacaGaaeqabaWace aaeaaakeaacqGHAiI1aaa@3D49@ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacPqpw0le9v8qqaqpepeeaY= Hhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=bYP0xH8peeu0xXd crpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGacaGaaeqabaWace aaeaaakeaacqGHAiI1aaa@3D49@

2.2   Assumptions

Let f ( Y | X ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaiaadAgadaqadeqaamaaeiqabaGaamywaiaaykW7aiaa wIa7aiaaykW7caWHybaacaGLOaGaayzkaaaaaa@43B3@ be the conditional density function of Y MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaiaadMfaaaa@3BB0@ given X MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaiaahIfaaaa@3BB3@ in the superpopulation model ζ . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaiabeA7a6jaac6caaaa@3D41@ Let f ( X ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaiaadAgadaqadeqaaiaaygW7caWHybGaaGzaVdGaayjk aiaawMcaaaaa@413C@ and f ( X | δ B = 1 ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaiaadAgadaqadeqaamaaeiqabaGaaCiwaiaaykW7aiaa wIa7aiaaykW7cqaH0oazdaWgaaWcbaGaamOqaaqabaGccaaMe8UaaG ypaiaaysW7caaIXaaacaGLOaGaayzkaaaaaa@4A13@ be the density function of X MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaiaahIfaaaa@3BB3@ in the finite population and Sample B, respectively, where δ B MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaiabes7aKnaaBaaaleaacaWGcbaabeaaaaa@3D6A@ is the indicator of selection to Sample B. We first make the following assumptions.

Assumption 1 (Ignorability). Conditional on X , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaiaahIfacaGGSaaaaa@3C63@  the density of Y MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaiaadMfaaaa@3BB0@  in Sample B follows the superpopulation model; i.e., f ( Y | X ; δ B = 1 ) = f ( Y | X ) . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaiaadAgadaqadeqaamaaeiqabaGaamywaiaaykW7aiaa wIa7aiaaykW7caWHybGaaG4oaiaaysW7cqaH0oazdaWgaaWcbaGaam OqaaqabaGccaaMe8UaaGypaiaaysW7caaIXaaacaGLOaGaayzkaaGa aGjbVlaai2dacaaMe8UaamOzamaabmqabaWaaqGabeaacaWGzbGaaG PaVdGaayjcSdGaaGPaVlaahIfaaiaawIcacaGLPaaacaGGUaaaaa@5AB7@

Assumptions 1 and 2 constitute the strong ignorability condition (Rosenbaum and Rubin, 1983). This setup has previously been used by several authors; see, e.g., Rivers (2007), Vavreck and Rivers (2008). Assumption 1 states the ignorability of the selection mechanism to Sample B conditional upon the covariates. Assumption 1 also implies that P ( δ B = 1 | X , Y ) = P ( δ B = 1 | X ) . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaiaadcfadaqadeqaamaaeiqabaGaeqiTdq2aaSbaaSqa aiaadkeaaeqaaOGaaGjbVlaai2dacaaMe8UaaGymaiaaykW7aiaawI a7aiaaykW7caWHybGaaGilaiaaysW7caWGzbaacaGLOaGaayzkaaGa aGjbVlaai2dacaaMe8UaamiuamaabmqabaWaaqGabeaacqaH0oazda WgaaWcbaGaamOqaaqabaGccaaMe8UaaGypaiaaysW7caaIXaGaaGPa VdGaayjcSdGaaGPaVlaahIfaaiaawIcacaGLPaaacaGGUaaaaa@60DC@ This assumption holds if the set of covariates contains all predictors for the outcome that affect the possibility of being selected in Sample B. Under this assumption, the missing outcomes in Sample A are missing at random (Rubin, 1976).

Assumption 2 (Common support). The vector of covariates X R p MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaiaahIfacaaMe8UaeyicI4SaaGjbVhHbbX2zLjxAH5ga iuaacqWFsbGudaahaaWcbeqaaiaadchaaaaaaa@45E3@ has a compact and convex support, with its density bounded and bounded away from zero. There exist constants C l MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaiaadoeadaWgaaWcbaGaamiBaaqabaaaaa@3CB7@  and C u MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaiaadoeadaWgaaWcbaGaamyDaaqabaaaaa@3CC0@  such that C l f ( X ) / f ( X | δ B = 1 ) C u MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaamaalyaabaGaam4qamaaBaaaleaacaWGSbaabeaakiaa ysW7cqGHKjYOcaaMe8UaamOzamaabmqabaGaaGzaVlaahIfacaaMb8 oacaGLOaGaayzkaaaabaGaamOzamaabmqabaWaaqGabeaacaWHybGa aGPaVdGaayjcSdGaaGPaVlabes7aKnaaBaaaleaacaWGcbaabeaaki aaysW7caaI9aGaaGjbVlaaigdaaiaawIcacaGLPaaaaaGaaGjbVlab gsMiJkaaysW7caWGdbWaaSbaaSqaaiaadwhaaeqaaaaa@5E0E@  almost surely.

Assumption 2 implies that the support of X MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaiaahIfaaaa@3BB3@ in Sample B is the same as that in the finite population. This assumption can also be formulated as a positivity assumption that P ( δ B = 1 | X ) > 0 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaiaadcfadaqadeqaamaaeiqabaGaeqiTdq2aaSbaaSqa aiaadkeaaeqaaOGaaGjbVlaai2dacaaMe8UaaGymaiaaykW7aiaawI a7aiaaykW7caWHybaacaGLOaGaayzkaaGaaGjbVlaai6dacaaMe8Ua aGimaaaa@4E99@ for all X . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbmv3yPrwyGmuy SXwANjxyWHwEaebbnrfifHhDYfgasaacH8rrps0lbbf9q8qqaqpepe c8EeeG0JXdf9arpi0xb9Lqpe0dbvb9frpepeI8k8hiNsFfY=qqqrFf pie9qqpe0dd9q8qi0de9Fve9Fve9pXqaaeaabiGaciaacaqabeaadi qaaqaaaOqaaiaahIfacaGGUaaaaa@3C65@ Assumption 2 does not hold if certain units would never be included in the big data sample. The plausibility of this assumption can be judged by subject matter knowledge. For diagnosis purpose, we can examine the distribution of the estimated propensity scores or the distribution of the propensity score weights in Sample A. Values of propensity score close to zero or extreme large values of the propensity score weights indicate the possible positivity violation. We assume all covariates are continuous. Categorical variables can be handled by first defining imputation classes using the partition of the categories and then estimating the average of the outcome using the nearest neighbor imputation within imputation classes. In our context, Sample B is a big data sample and therefore the size of donors for each imputation class can be reasonable large.


Date modified: