# A comparison between nonparametric estimators for finite population distribution functions 2. Definition of the estimators

Let $\left({y}_{i}\mathrm{,}{x}_{i}\right)$ denote the values taken on by a study variable $Y$ and an auxiliary variable $X$ on unit $i$ of a finite population $U\mathrm{:=}\left\{\mathrm{1,2,}\dots \mathrm{,}N\right\}.$ Suppose that

$${y}_{i}\mathrm{=}m\left({x}_{i}\right)+{\epsilon}_{i}\mathrm{,}\text{\hspace{1em}}\text{\hspace{1em}}i\in U\mathrm{,}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}(2.1)$$

where $m\left(x\right)$ is a smooth function and where the ${\epsilon}_{i}\u2019\text{s}$ are independent zero mean random variables whose distribution functions $P\left({\epsilon}_{i}\le \epsilon \right)\mathrm{=}G\left(\epsilon \text{\hspace{0.17em}}|{x}_{i}\right)$ depend smoothly on ${x}_{i}.$ Let $s\subset U$ be a sample chosen from the population $U$ according to some sample design. As usual in the context of complete auxiliary information we assume that the ${x}_{i}\text{\hspace{0.17em}}-$ values are known for all population units, while the ${y}_{i}\text{\hspace{0.17em}}-$ values are observed only for the population units which belong to the sample $s.$

To estimate the unknown population distribution function

$${F}_{N}\left(t\right)\mathrm{:=}\frac{1}{N}{\displaystyle \sum _{i\in U}I\left({y}_{i}\le t\right)}\mathrm{,}$$

Kuo (1988) proposes the estimator given by

$$\widehat{F}\left(t\right)\mathrm{:=}\frac{1}{N}\left({\displaystyle \sum _{j\in s}I\left({y}_{j}\le t\right)}+{\displaystyle \sum _{i\notin s}}{\displaystyle \sum _{j\in s}{w}_{i\mathrm{,}j}I\left({y}_{j}\le t\right)}\right)\mathrm{,}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}(2.2)$$

where in place of ${w}_{i\mathrm{,}j}$ she suggests to use either the local constant regression weights

$${w}_{i\mathrm{,}j}\text{\hspace{0.17em}}\mathrm{:=}\frac{{\displaystyle K\left(\frac{{x}_{i}-{x}_{j}}{\lambda}\right)}}{{\displaystyle \sum _{k\in s}K\left(\frac{{x}_{i}-{x}_{k}}{\lambda}\right)}}$$

with some (integrable) kernel function in place of $K\left(u\right)$ and $\lambda \mathrm{>0},$ or the nearest $k$ neighbor weights

$${w}_{i\mathrm{,}j}\mathrm{:=}\{\begin{array}{ll}1/k,\hfill & \text{if}{x}_{j}\text{isoneofthe}k\text{nearestneighborsto}{x}_{i}\hfill \\ 0,\hfill & \text{otherwise}\text{.}\hfill \end{array}$$

Note that in the definition $\widehat{F}\left(t\right),$

$${\widehat{G}}_{i}\left(t\right)\mathrm{:=}{\displaystyle \sum _{j\in s}{w}_{i\mathrm{,}j}I\left({y}_{j}\le t\right)}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}(2.3)$$

is used as the fitted value in place of the unobserved indicator function $I\left({y}_{i}\le t\right)$ for $i\notin s.$

Following an idea put forward in the textbook of Chambers and Clark (2012), we shall analyze an estimator for ${F}_{N}\left(t\right)$ based on alternative fitted values which incorporate a nonparametric estimate for the mean regression function $m\left(x\right).$ The fitted values in question are given by

$${\widehat{G}}_{i}^{*}\left(t\right)\mathrm{:=}{\displaystyle \sum _{j\in s}{w}_{i\mathrm{,}j}I\left({y}_{j}-{\widehat{m}}_{j}\le t-{\widehat{m}}_{i}\right)}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}(2.4)$$

where

$${\widehat{m}}_{i}\mathrm{:=}{\displaystyle \sum _{k\in s}{w}_{i\mathrm{,}j}{y}_{j}}$$

is a nonparametric estimator for $m\left(x\right)$ at $x\mathrm{=}{x}_{i},$ and the resulting estimator for ${F}_{N}\left(t\right)$ is given by

$${\widehat{F}}^{\mathrm{*}}\left(t\right)\mathrm{:=}\frac{1}{N}\left({\displaystyle \sum _{j\in s}I\left({y}_{j}\le t\right)}+{\displaystyle \sum _{i\notin s}}{\displaystyle \sum _{j\in s}{w}_{i\mathrm{,}j}I\left({y}_{j}-{\widehat{m}}_{j}\le t-{\widehat{m}}_{i}\right)}\right)\mathrm{.}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}(2.5)$$

The fitted values in (2.3) and (2.4), or appropriately modified versions of them which include sample inclusion probabilities in the regression weights ${w}_{i\mathrm{,}j},$ can obviously be computed also for $i\in s,$ and they can be employed for example in generalized difference estimators (Särndal et al. 1992, page 221) or in model calibrated estimators (see for example Wu and Sitter 2001; Chen and Wu 2002; Wu 2003; Montanari and Ranalli 2005; Rueda, Martínez, Martínez and Arcos 2007; Rueda, Sànchez-Borrego, Arcos and Martínez 2010). In addition to the model-based estimators in (2.2) and (2.5), we shall thus consider also the generalized difference estimators given by

$$\tilde{F}\left(t\right)\mathrm{:=}\frac{1}{N}\left({\displaystyle \sum _{i\in U}}{\displaystyle \sum _{j\in s}{\tilde{w}}_{i\mathrm{,}j}I\left({y}_{j}\le t\right)}\right)+{\displaystyle \sum _{i\in s}{\pi}_{i}^{-1}}\left(I\left({y}_{i}\le t\right)-{\displaystyle \sum _{j\in s}{\tilde{w}}_{i\mathrm{,}j}I\left({y}_{j}\le t\right)}\right)$$

and by

$${\tilde{F}}^{\mathrm{*}}\left(t\right)\mathrm{:=}\frac{1}{N}\left({\displaystyle \sum _{i\in U}}{\displaystyle \sum _{j\in s}{\tilde{w}}_{i\mathrm{,}j}I\left({y}_{j}-{\tilde{m}}_{j}\le t-{\tilde{m}}_{i}\right)}\right)+{\displaystyle \sum _{i\in s}{\pi}_{i}^{-1}\left(I\left({y}_{i}\le t\right)-{\displaystyle \sum _{j\in s}{\tilde{w}}_{i\mathrm{,}j}I\left({y}_{j}-{\tilde{m}}_{j}\le t-{\tilde{m}}_{i}\right)}\right)}$$

where ${\pi}_{i}$ denotes the first order sample inclusion probabilities, ${\tilde{w}}_{i\mathrm{,}j}$ denotes design weighted regression weights whose definition is given below, and ${\tilde{m}}_{i}\mathrm{:=}{\displaystyle {\sum}_{k\in s}}{\tilde{w}}_{i\mathrm{,}k}{y}_{k}.$ Note that $\tilde{F}\left(t\right)$ and ${\tilde{F}}^{\mathrm{*}}\text{}\left(t\right)$ are based on design weighted counterparts of the fitted values ${\widehat{G}}_{i}\left(t\right)$ and ${\widehat{G}}_{i}^{*}\left(t\right)$ which are given by

$${\tilde{G}}_{i}\left(t\right)\mathrm{:=}{\displaystyle \sum _{j\in s}{\tilde{w}}_{i\mathrm{,}j}I\left({y}_{j}\le t\right)}$$

and

$${\tilde{G}}_{i}^{*}\left(t\right)\mathrm{:=}{\displaystyle \sum _{j\in s}{\tilde{w}}_{i\mathrm{,}j}I\left({y}_{j}-{\tilde{m}}_{j}\le t-{\tilde{m}}_{i}\right)}\mathrm{,}$$

respectively.

As for the regression weights ${w}_{i\mathrm{,}j}$ and ${\tilde{w}}_{i\mathrm{,}j},$ in the present work we consider local linear regression weights in their place. In what follows ${w}_{i\mathrm{,}j}$ and ${\tilde{w}}_{i\mathrm{,}j}$ are thus defined by

$${w}_{i\mathrm{,}j}\mathrm{:=}\frac{1}{n\lambda}K\left(\frac{{x}_{i}-{x}_{j}}{\lambda}\right)\frac{{M}_{\mathrm{2,}s}\left({x}_{i}\right)-\left(\frac{{x}_{i}-{x}_{j}}{\lambda}\right){M}_{\mathrm{1,}s}\left({x}_{i}\right)}{{M}_{\mathrm{2,}s}\left({x}_{i}\right){M}_{\mathrm{0,}s}\left({x}_{i}\right)-{M}_{\mathrm{1,}s}^{2}\left({x}_{i}\right)}$$

and

$${\tilde{w}}_{i\mathrm{,}j}\mathrm{:=}\frac{1}{{\pi}_{j}n\lambda}K\left(\frac{{x}_{i}-{x}_{j}}{\lambda}\right)\frac{{\tilde{M}}_{\mathrm{2,}s}\left({x}_{i}\right)-\left(\frac{{x}_{i}-{x}_{j}}{\lambda}\right){\tilde{M}}_{\mathrm{1,}s}\left({x}_{i}\right)}{{\tilde{M}}_{\mathrm{2,}s}\left({x}_{i}\right){\tilde{M}}_{\mathrm{0,}s}\left({x}_{i}\right)-{\tilde{M}}_{\mathrm{1,}s}^{2}\left({x}_{i}\right)}\mathrm{,}$$

where $n$ is the number of units in the sample $s,$

$${M}_{r\mathrm{,}s}\left(x\right)\mathrm{:=}{\displaystyle \sum _{k\in s}}\frac{1}{n\lambda}K\left(\frac{x-{x}_{k}}{\lambda}\right){\left(\frac{x-{x}_{k}}{\lambda}\right)}^{r}\mathrm{,}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}r\mathrm{=0,1,2,}$$

and

$${\tilde{M}}_{r\mathrm{,}s}\left(x\right)\mathrm{:=}{\displaystyle \sum _{k\in s}\frac{1}{{\pi}_{k}n\lambda}K}\left(\frac{x-{x}_{k}}{\lambda}\right){\left(\frac{x-{x}_{k}}{\lambda}\right)}^{r}\mathrm{,}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}r\mathrm{=0,1,2.}$$

It is worth noting that the nonparametric estimators of this section are not well-defined if the regression weights ${w}_{i\mathrm{,}j}$ and ${\tilde{w}}_{i\mathrm{,}j}$ included in their definitions are not well-defined. This problem occurs for example when the support of the kernel function $K\left(u\right)$ is given by the interval $\left[-\mathrm{1,1}\right]$ (e.g., uniform kernel, Epanechnikov kernel), and when there are not at least two $j\in s$ such that $\left|\text{\hspace{0.17em}}{x}_{i}-{x}_{j}\text{\hspace{0.17em}}\right|\mathrm{<}\lambda .$ To overcome this problem one can use a kernel function whose support is given by the whole real line (e.g., Gaussian kernel) or choose the bandwidth adaptively. The latter solution may also lead to more efficient estimators (see e.g., Fan and Gijbels 1992). With reference to the estimators ${\widehat{F}}^{\mathrm{*}}\text{}\left(t\right)$ and ${\tilde{F}}^{\mathrm{*}}\text{}\left(t\right)$ based on the modified fitted values, it is moreover worth noting that one could in principle apply different bandwidths and/or regression weights to the ${y}_{i}-$ values and to the indicator functions. For the sake of simplicity, in the present work we shall consider neither adaptive bandwidth selection nor the possibility of different regression weights to estimate the mean regression function and the distributions of the error components.

Comparing the definitions of the estimators based on the two types of fitted values, it becomes immediately obvious that $\widehat{F}\left(t\right)$ and $\tilde{F}\left(t\right)$ are easier to compute since they are linear combinations of the observed indicator functions $I\left({y}_{j}\le t\right).$ The coefficients of these linear combinations do not depend on the study variable $Y$ and they can therefore be used to estimate averages of other functions than indicator functions, or of functions of several study variables, in particular when there are reasons to believe that the latter are related to the auxiliary variable $X.$ This fact is of particular value to practitioners who want estimates related to several study variables to be consistent with one another. However, there is a strong argument in favor of the estimators ${\widehat{F}}^{\mathrm{*}}\text{}\left(t\right)$ and ${\tilde{F}}^{\mathrm{*}}\text{}\left(t\right)$ based on the modified fitted values too: if ${y}_{i}\mathrm{=}a+b{x}_{i}$ for all $i\in U,$ then it follows that ${\widehat{F}}^{\mathrm{*}}\text{}\left(t\right)\mathrm{=}{\tilde{F}}^{\mathrm{*}}\text{}\left(t\right)\mathrm{=}{F}_{N}\left(t\right)$ for every sample $s$ such that the estimators are well-defined. One would therefore expect that ${\widehat{F}}^{\mathrm{*}}\text{}\left(t\right)$ and ${\tilde{F}}^{\mathrm{*}}\text{}\left(t\right)$ be more efficient than $\widehat{F}\left(t\right)$ and $\tilde{F}\left(t\right)$ when there is a strong regression relationship between $Y$ and $X.$

- Date modified: