# 6. Suggested method

Sun Woong Kim, Steven G. Heeringa and Peter W. Solenberger

In this section, we present the details on an algorithm for achieving S1 and S2 of optimal solutions described in Section 4.

## 6.1 The algorithm

The algorithm has the following characteristics: 1) it finds a solution directly based on the values of the distance $\text{\hspace{0.17em}}{d}_{2}$ $\left({d}_{\infty }\right)$ between the controlled selection problem $A$ and each individual array ${B}_{k}$ in $\mathfrak{B}$ ; 2) it is computer-intensive, but easily implemented by LP; 3) it is applicable to any type of controlled selection problem with two-way stratification.

The algorithm has five steps. They are as follows:

Step 1. Find the set of all possible arrays, $\mathfrak{B}$ , satisfying (3.1) - (3.4) for a given controlled selection problem $A$ . Specifically, if there are any noninteger marginal expectations in $A$ , find all possible roundings of these marginal expectations by adjacent integers, which satisfy (3.3) and (3.4). Those rounded marginal integers will be $\left[{a}_{i.}\right]$ or $\left[{a}_{i.}\right]+1$ $\left(\left[{a}_{.j}\right]$ or $\left[{a}_{.j}\right]+1\right)$, while the integer marginal expectations will remain, since $\left[{a}_{i.}\right]={a}_{i.}$ ( $\left[{a}_{.j}\right]={a}_{.j}$ ). Next, find all possible arrays satisfying (3.1) and (3.2) under the rounded marginal integers and the other marginal integers.

Step 2. Choose either ${d}_{2}^{*}\left({B}_{k},A\right)$ or ${d}_{\infty }^{*}\left({B}_{k},A\right)$ (based on preference) and compute the chosen distance function for each ${B}_{k}\text{\hspace{0.17em}}\left(\in \mathfrak{B}\right),$ where:

Note that since each of the $ij$ cells in the problem array, $A$ , will receive a minimum allocation equal to $\text{ }\left[{a}_{ij}\right]$ with certainty the distance functions need only consider the non-integer part of ${a}_{ij}$ :

${a}_{ij}^{*}={a}_{ij}-\text{ }\text{ }\text{ }\left[{a}_{ij}\right],\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\left(6.3\right)$

and the integer difference (either 0 or 1) between the allocated sample size, ${b}_{ijk}$ , for solution $k=1,\dots ,L$ and the certainty count for the $ij\text{-th}$ cell of $A$ :

${b}_{ijk}^{*}={b}_{ijk}-\text{ }\text{ }\text{ }\left[{a}_{ij}\right].\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\left(6.4\right)$

Step 3. According to the distance function chosen in Step 2, construct the following LP problem consisting of the minimization of the objective function (6.5) or (6.6), which is a linear form, with the linear constraints (6.7) and (6.8):

Minimize

$O{F}_{1}=\sum _{{B}_{k}\in \mathfrak{B}}{d}_{2}^{*}\left({B}_{k},A\right)p\left({B}_{k}\right)\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\left(6.5\right)$

or

$O{F}_{2}=\sum _{{B}_{k}\in \mathfrak{B}}{d}_{\infty }^{*}\left({B}_{k},A\right)\text{\hspace{0.17em}}p\left({B}_{k}\right)\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\left(6.6\right)$

subject to

and

Step 4. By using an algorithm for LP, solve the LP problem established in Step 3 with respect to $L$ unknown variables

$\left\{p\left({B}_{k}\right),\text{\hspace{0.17em}}{B}_{k}\in \mathfrak{B}\right\}.\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\left(6.9\right)$

Step 5. Obtain the solution set $\left\{\left({B}_{k},p\left({B}_{k}\right)\right),\text{\hspace{0.17em}}\text{ }{B}_{k}\in {\mathfrak{B}}^{\prime }\right\}$ to $A$ consisting of arrays such that $p\left({B}_{k}\right)>0$ in the solution set to the LP problem obtained in Step 4.

Some remarks to be useful in implementing the algorithm are in order.

Remark 6.1. In Step 2, note that $\left[{a}_{ij}\right]$ in (6.3) or (6.4) indicates the number of units to be selected with certainty in each cell. Also, note that 

${d}_{2}^{*}\left({B}_{k},A\right)={d}_{2}\left({B}_{k},A\right)\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\left(6.10\right)$

and

${d}_{\infty }^{*}\left({B}_{k},A\right)={d}_{\infty }\left({B}_{k},A\right),\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\left(6.11\right)$

since ${b}_{ijk}^{*}-{a}_{ij}^{*}={b}_{ijk}-\text{ }{a}_{ij}$ due to (6.3) and (6.4).

Remark 6.2. In addition to the fact that ${d}_{2}^{*}$ is the natural concept of distance and $\text{\hspace{0.17em}}{d}_{\infty }^{*}$ is the simplest and easiest to compute under the norm, there is sensible advice on the choice of $\text{\hspace{0.17em}}{d}_{2}^{*}$ or $\text{\hspace{0.17em}}{d}_{\infty }^{*}$ in Step 2. Let ${D}_{2}$ and ${D}_{\infty }$ be the sets of the distance values for all possible arrays calculated by $\text{\hspace{0.17em}}{d}_{2}^{*}$ and $\text{\hspace{0.17em}}{d}_{\infty }^{*}$ , respectively. Let those arrays with the same distance value in ${D}_{2}$ ( ${D}_{\infty }$ ) be in the same group. Then logically, ${d}_{2}^{*}$ would cluster possible arrays into many different groups, where the number of groups is larger than in ${d}_{\infty }^{*}$ , due to (4.3) and (4.4). Accordingly, when using ${d}_{2}^{*}$ in LP problem, the number of arrays in $\mathfrak{B}$ such that $p\left({B}_{k}\right)>0$ would be larger than in using ${d}_{\infty }^{*}$ .

Remark 6.3. It is clear from (6.5) and (6.6) involving the distance values $\text{\hspace{0.17em}}{d}_{2}^{*}$ or $\text{\hspace{0.17em}}{d}_{\infty }^{*}$ that the solution in Step 5 results in the safe achievement of S1. Furthermore, S2 is achieved efficiently using linear constraints (6.7) and (6.8).

Remark 6.4. In constructing the LP problem in Step 3, the constraints for the cells with ${a}_{ij}^{*}=0$ can be omitted in (6.7). For example, for the $5×5$ controlled selection problem of Problem 2.4, the number of necessary constraints is 23, since two cells have ${a}_{ij}^{*}=0$ . Also, the linear constraint (3.6) is not essential, because it is implied in (6.7).

## 6.2 Using the simplex method

The LP problem constructed in Step 3 with the system of constraints of $RC$ equations in (6.7) for $L$ nonnegative unknowns in (6.8) is in the “standard form” and no transformation is required.

Supposing that $RC , the number of equations is smaller than the number of unknowns. Consequently it is an LP problem with a standard form, and it can always be solved by the simplex method by transforming with the system of $RC$ constraints in canonical form. To change the system into canonical form, one could arbitrarily choose $RC$ variables among $L$ variables as basic variables and then, using a pivot operation, attempt to put the system into canonical form, where each basic variable has coefficient one in one equation and zero in the others, and each equation has exactly one basic variable with coefficient one.

Letting the other $L-RC$ variables except $RC$ variables chosen as basic variables be 0 in the system in canonical form, the initial basic feasible solution is obtained. Next, by replacing exactly one basic variable, another basic feasible solution is obtained, and these steps are continued until the minimal value of the objective function is attained by any basic feasible solution. The set of these basic feasible solutions to the LP problem is convex. Many software packages for the simplex method are available for solving the LP problem. See Dantzig (1963) and Thie and Keough (2008, chapter 3) for the details on the simplex method.

## 6.3 The computational demands of the LP problem

It may be claimed that our algorithm is computationally expensive due to the following burdens:

1. Before solving the LP problem, all possible arrays to the controlled selection problem should be known.

2. The number of unknowns in the LP problem, $L$ , is equal to the number of all possible arrays, which becomes large as $RC$ , the number of cells in the controlled selection problem, increases. Hence, it is not unreasonable that $L$ may be as large as the binomial coefficient

3. If $RC$ is large, it also yields a large number of constraints in (6.7).

Sitter and Skinner (1994), and Tiwari and Nigam (1998) also referred to these potential disadvantages in describing their LP algorithms. However, due to the following reasons, these computational burdens stated in a, b, and c may not be prohibitive in actual operations.

First, finding all possible arrays manually might be difficult for any controlled selection problem with a large number of cells, but this task is greatly simplified using an efficient algorithm and the power of modern computers. Using the software described in the next section, they can be easily obtained in seconds even in comparatively large problems such as Problems 2.3 and 2.4.

Second, applying (6.12) to Problems 2.1 through 2.4, respectively yie1ds 84; 11,440; 10,626; and 4,457,400 arrays. However, the actual numbers for $L$ are only 6, 30, 141 and 159, respectively. This is because marginal expectations of both rows and columns are simultaneously matched and some cell expectations are zero. The actual numbers can also be obtained from the software described in the next section.

Third, although the greater $RC$ , the greater the number of constraints in the LP problem, the computational demands may depend on $L$ as well as $RC$ , and more specifically, on the number of basic feasible solutions, possibly denoted by

$S=\left(\begin{array}{c}L\\ RC\end{array}\right).\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\left(6.13\right)$

For example, if $L=$ 1,000 and $RC=$ 100, (6.13) gives 6.4E+139, which is an extremely large number. In this case, it is almost impossible to solve the LP problem, since each basic feasible solution should be investigated. But such cases would not happen in practice. According to Ross (2007, pages 221-224), when $RC , the number of necessary transitions, say $T,$ moving along the basic feasible solutions in solving the LP problem with standard form is approximately normally distributed with mean $E\left(T\right)={\mathrm{log}}_{e}\text{​}S$ and variance $Var\left(T\right)={\mathrm{log}}_{e}\text{​}S$ , where

${\mathrm{log}}_{e}\text{​}S\approx RC\left[1+{\mathrm{log}}_{e}\left\{\left(L/RC\right)-1\right\}\right].\text{ }\text{ }\text{ }\text{ }\text{ }\left(6.14\right)$

When applying this theory to the case of $L=$ 1,000 and $RC=$ 100, approximating both the mean and variance of $T$ by (6.14) becomes 320, and the 95% confidence interval (CI) of $T$ is (285, 355), which is smaller than the expected lower and upper limits.

Problem 2.1 Problem 2.2 Problem 2.3 Problem 2.4 6 30 141 159 9 14 13 23 NA 1.50E+08 7.90E+17 3.10E+27 NA 16 43 64 NA (8, 24) (30, 56) (48, 80) Note: NA- not available

Table 6.1 shows the results of the comparison between $S$ and $T$ for the four problems considered above. Note that due to Remark 6.4, $RC$ in (6.13) and (6.14) is replaced by $R{C}^{*}$ , that is, the number that results from subtracting the number of cells with ${a}_{ij}^{*}=0$ from $RC$ . The theory on $T$ is not applied to Problem 2.1 because $R{C}^{*}>L$ .

As shown in the table, the mean or confidence interval bounds of $T$ are considerably smaller than $S$ in each problem. In Section 8, $T$ in Table 6.1 will be compared with the actual number of transitions, say $t$ .

Is something not working? Is there information outdated? Can't find what you're looking for?