2. Controlled selection problems

Sun Woong Kim, Steven G. Heeringa and Peter W. Solenberger

In order to select a sample of $n$ units, consider a two-way stratification design classifying a population of $N$ units by two criteria with $R$ and $C$ categories, respectively. The controlled selection problem under two-way stratification is defined by the $R×C$ tabular array $A$ , which consists of $R\text{ }C$ cells that have nonnegative real numbers ${a}_{ij}$ , called the cell expectations, representing the expected number of units to be drawn in each cell $ij$ . The standard two-way controlled selection problem is described as in Table 2.1.

Table 2.1
$R×C$ Controlled selection problem
Table summary
This table displays the results of $R×C$ Controlled selection problem. The information is grouped by Category (appearing as row headers), 1, 2, $j$ , $C$ and Marginal expectation (appearing as column headers).
Category 1 2 $\cdot$ $\cdot$ $j$ $\cdot$ $\cdot$ $C$ Marginal expectation
1 ${a}_{11}$ ${a}_{12}$ $\cdot$ $\cdot$ $\cdot$ $\cdot$ $\cdot$ ${a}_{1C}$ ${a}_{1.}$
2 ${a}_{21}$ ${a}_{22}$ $\cdot$ $\cdot$ $\cdot$ $\cdot$ $\cdot$ ${a}_{2C}$ ${a}_{2.}$
$\cdot$ $\cdot$ $\cdot$   $\cdot$ $\cdot$
$\cdot$ $\cdot$ $\cdot$   $\cdot$ $\cdot$
$i$ $\cdot$ $\cdot$ $\cdot$ $\cdot$ ${a}_{ij}$ $\cdot$ $\cdot$ $\cdot$ ${a}_{i.}$
$\cdot$ $\cdot$ $\cdot$   $\cdot$ $\cdot$
$\cdot$ $\cdot$ $\cdot$   $\cdot$ $\cdot$
$R$ ${a}_{R1}$ ${a}_{R2}$ $\cdot$ $\cdot$ $\cdot$ $\cdot$ $\cdot$ ${a}_{RC}$ ${a}_{R.}$
Marginal expectation ${a}_{.1}$ ${a}_{.2}$ $\cdot$ $\cdot$ ${a}_{.j}$ $\cdot$ $\cdot$ ${a}_{.C}$ ${a}_{..}\left(=n\right)$

The marginal expectations ${a}_{i.}$ and ${a}_{.j}$ denote the sum of cell expectations in each row category $i$ and each column category $j$ . Hence ${a}_{..}$ denotes the sum of all cell expectations and equals the total sample size $n.$

Although Table 2.1 takes a simple two-way tabular form, it should be noted that typically $n , and furthermore ${a}_{ij}$ can be very small (e.g., often less than $1$ ). In this case deciding how to allocate $n$ units to cells, that is, how to obtain an $R×C$ array with cells rounded to a nonnegative integer for each ${a}_{ij}$ , requires an algorithm to solve the problem.

A variety of controlled selection problems are used as examples in the literature. The first example of a controlled selection problem was the $17×4$ array, described by Goodman and Kish (1950, page 356), for allocating 17 PSU’s to 68 cells given by 17 strata and 4 groups of North Central States in the United States. The array may be formed as follows. Let ${N}_{ij}$ denote the number of population elements in each cell $ij$ and let ${N}_{i.}$ denote the total number of population elements in each stratum. Then ${a}_{ij}={N}_{ij}/{N}_{i.}$ , where some ${N}_{ij}$ are zero and $0\le {a}_{ij}<1$ . All ${a}_{i.}$ equal the integer $1$ , whereas ${a}_{.j}$ are nonintegers sums of the ${a}_{ij}$ in column $j$ . The problem is therefore one of selecting one PSU per sample stratum ( $i$ dimension) and simultaneously controlling the distribution to state groups ( $j$ dimension). A total of $n=17$ PSUs will be selected.

The following paragraphs describe four additional problems found in the literature that will be used in the discussion and comparative evaluations presented in this paper.

Problem 2.1: Jessen (1970)

A $3×3$ problem involving two stratifying variables is given by Jessen (1970, page 779). Each cell $ij$ corresponds to one PSU and $N=9$ . A sample of size $n=6$ is drawn. ${a}_{ij}\text{\hspace{0.17em}}=n{X}_{ij}/X$ , where ${X}_{ij}$ is a “measure of size” for the PSU in cell $ij$ and $X={\sum }_{i=1}^{R}{\sum }_{j=1}^{C}{X}_{ij}$ . Note that in this problem, $0<{a}_{ij}<1$ , and both ${a}_{i.}$ and ${a}_{.j}$ are equal to $2$ .

Problem 2.2: Jessen (1978)

An extended $4×4$ version of Problem 2.1 comes from Jessen (1978, page 375). In this problem, $N=16$ and $n=8$ . As in Problem 2.1, both ${a}_{i.}$ and ${a}_{.j}$ are equal to $2$ , but $0\le {a}_{ij}\le 1$ .

Problem 2.3: Causey et al. (1985)

Causey et al. (1985, page 906) describe an $8×3$ two-way stratification problem designed to select 10 PSU’s, that is, $n=10$ . Let ${X}_{ij{q}_{ij}}$ $\left({q}_{ij}=1,\dots ,{r}_{ij}\right)$ be some measure of size for the PSU ${q}_{ij}$ in cell $ij$ . Here ${a}_{ij}=n{X}_{ijq}/{X}_{q}$ , where ${X}_{ijq}={\sum }_{{q}_{ij}=1}^{{r}_{ij}}{X}_{ij{q}_{ij}}$ and ${X}_{q}={\sum }_{i=1}^{R}{\sum }_{j=1}^{C}{\sum }_{{q}_{ij}=1}^{{r}_{ij}}{X}_{ij{q}_{ij}}$ . Note that in this problem, $0\le {a}_{ij}\le 2$ , and most ${a}_{i.}$ and ${a}_{.j}$ are noninteger values.

Problem 2.4: Winkler (2001)

Winkler (2001) provides the $5×5$ controlled selection problem with two stratifying variables shown in Table 2.2.

The objective in solving this problem is to select $n=37$ sample units from the population of $N=1,251.$ The problem definition begins with a $5×5$ array with cell population sizes ${N}_{ij}$ , where some ${N}_{ij}$ are quite small. The marginal row and column expectations, ${a}_{i.}$ and ${a}_{.j}$ , are integer-valued and are predetermined using the prior information on precision (e.g., coefficients of variation).

Table 2.2
$5×5$ Controlled selection problem
Table summary
This table displays the results of $5×5$ Controlled selection problem. The information is grouped by Category (appearing as row headers), 1, 2, 3, 4, 5 and Marginal expectation (appearing as column headers).
Category 1 2 3 4 5 Marginal expectation
1 2.000 2.483 1.052 0.103 0.362 6
2 2.182 1.061 1.101 1.046 0.610 6
3 0.000 1.614 1.914 2.200 1.272 7
4 0.860 0.377 0.930 2.840 2.993 8
5 0.958 0.465 2.003 1.811 4.763 10
Marginal expectation 6 6 7 8 10 37

The cell expectations, ${a}_{ij}$ , are obtained by applying the generalized iterative fitting procedure (GIFP) of Dykstra (1985a, 1985b) and Winkler (1990) to the initial array. The GIFP is used to ensure that ${a}_{ij}\text{\hspace{0.17em}}<{N}_{ij}$ for the cells with small ${N}_{ij}$ , when ${a}_{i.}$ and ${a}_{.j}$ are given. Note that in the Table 2.2, the ${a}_{ij}$ are given to 3 decimal places, and $0\le {a}_{ij}<5$ .

The common characteristic shared by these controlled selection problems is that, as mentioned above, the total number of selected units is smaller than the number of cells (except for Problem 2.4, where $n=37>RC=25$ ) and many ${a}_{ij}$ are less than 1. The algorithms used to solve these problems must enforce some strict constraints described in next section. As described in Section 4, the solution to a controlled selection problem obtained by any algorithm is a set of some $R×C$ arrays and probabilities of selection corresponding to each array.

Is something not working? Is there information outdated? Can't find what you're looking for?