Integration of data from probability surveys and big found data for finite population inference using mass imputation
Section 2. Basic setup
2.1 Notation: Two data sources
Let with denote a finite population, where is a -dimensional vector of covariates, and is the study variable. We assume that is a random sample from a superpopulation model and is known. Our objective is to estimate the general finite population parameter for some known For example, if is the population mean of If for some constant is the population proportion of less than
Suppose that there are two data sources, one from a probability sample, referred to as Sample A, and the other from a big data source, referred to as Sample B. Table 2.1 illustrates the observed data structure. Sample A contains observations with sample size where is known throughout Sample A, and Sample B contains observations with sample size Often the probability sample contains many other items but we only use those items overlapping with our big data. Although the big data source has a large sample size, the sampling mechanism is often unknown, and we cannot compute the first-order inclusion probability for Horvitz-Thompson estimation. The naive estimators without adjusting for the sampling process are subject to selection biases. On the other hand, although the probability sample with sampling weights represents the finite population, it does not observe the study variable.
| Sample weight | Covariate | Study Variable | ||
|---|---|---|---|---|
| Probability Sample |
1 | ? | ||
| ? | ||||
| Big Data Sample |
1 | ? | ||
| ? | ||||
| Sample A is a probability sample, and Sample B is a big data but may have selection biases. | ||||
2.2 Assumptions
Let be the conditional density function of given in the superpopulation model Let and be the density function of in the finite population and Sample B, respectively, where is the indicator of selection to Sample B. We first make the following assumptions.
Assumption 1 (Ignorability). Conditional on the density of in Sample B follows the superpopulation model; i.e.,
Assumptions 1 and 2 constitute the strong ignorability condition (Rosenbaum and Rubin, 1983). This setup has previously been used by several authors; see, e.g., Rivers (2007), Vavreck and Rivers (2008). Assumption 1 states the ignorability of the selection mechanism to Sample B conditional upon the covariates. Assumption 1 also implies that This assumption holds if the set of covariates contains all predictors for the outcome that affect the possibility of being selected in Sample B. Under this assumption, the missing outcomes in Sample A are missing at random (Rubin, 1976).
Assumption 2 (Common support). The vector of covariates has a compact and convex support, with its density bounded and bounded away from zero. There exist constants and such that almost surely.
Assumption 2 implies that the support of in Sample B is the same as that in the finite population. This assumption can also be formulated as a positivity assumption that for all Assumption 2 does not hold if certain units would never be included in the big data sample. The plausibility of this assumption can be judged by subject matter knowledge. For diagnosis purpose, we can examine the distribution of the estimated propensity scores or the distribution of the propensity score weights in Sample A. Values of propensity score close to zero or extreme large values of the propensity score weights indicate the possible positivity violation. We assume all covariates are continuous. Categorical variables can be handled by first defining imputation classes using the partition of the categories and then estimating the average of the outcome using the nearest neighbor imputation within imputation classes. In our context, Sample B is a big data sample and therefore the size of donors for each imputation class can be reasonable large.
- Date modified: