# A short note on quantile and expectile estimation in unequal probability samples 2. Quantile estimationA short note on quantile and expectile estimation in unequal probability samples 2. Quantile estimation

We consider a finite population with $N$  elements and a continuous survey variable $Y.$  We are interested in quantiles of the cumulative distribution function $F\left(y\right)={\sum }_{i=1}^{N}1\left\{{Y}_{i}\le y\right\}/N$  and define as

the Quantile function of $Y$ (see Koenker 2005), where

${w}_{\alpha }\left(\epsilon \right)=\left(\begin{array}{ll}\alpha \hfill & \text{for}\text{\hspace{0.17em}}\epsilon >0\hfill \\ 1-\alpha \hfill & \text{for}\text{\hspace{0.17em}}\epsilon \le 0.\hfill \end{array}$

The “inf” argument in (2.1) is required in finite populations since the “arg min” is not unique. We draw a sample from the population with known inclusion probabilities ${\pi }_{i},$ $i=1,\dots ,N.$ Denoting by ${y}_{1},\dots ,{y}_{n}$ the resulting sample, we estimate the quantile function by replacing (2.1) through its weighted sample version

${\stackrel{^}{Q}}_{N}\left(\alpha \right)=\mathrm{inf}\left\{\mathrm{arg}\underset{q}{\mathrm{min}}\sum _{j=1}^{n}\text{\hspace{0.17em}}\frac{1}{{\pi }_{j}}{w}_{\alpha ,j}|\text{\hspace{0.17em}}{y}_{j}-q\text{\hspace{0.17em}}|\right\}\text{ }\text{ }\text{ }\text{ }\text{ }\left(2.2\right)$

with ${w}_{\alpha ,j}={w}_{\alpha }\left({y}_{j}-q\right)$ as defined above. It is easy to see that the sum in (2.2) is a design-unbiased estimate for the sum in $Q\left(\alpha \right)$ given in (2.1). Nonetheless, because we take the “arg min” it follows that ${\stackrel{^}{Q}}_{N}\left(\alpha \right)$ is not unbiased for $Q\left(\alpha \right).$ We therefore look at consistency statements for ${\stackrel{^}{Q}}_{N}\left(\alpha \right)$ as follows. Let ${R}_{i}\left(q\right)={w}_{\alpha }\left({y}_{i}-q\right)|\text{\hspace{0.17em}}{y}_{i}-q\text{\hspace{0.17em}}|$ and

${\overline{R}}_{N}\left(q\right):=\frac{1}{N}\sum _{i}\text{\hspace{0.17em}}{R}_{i}\left(q\right).$

We draw a sample from ${R}_{i}\left(q\right),i=1,\dots ,N$ and assume we apply a consistent sampling scheme in that

${\overline{r}}_{n}\left(q\right):=\frac{1}{N}\sum _{j=1}^{n}\text{\hspace{0.17em}}\frac{1}{{\pi }_{j}}{r}_{j}\left(q\right)$

is design-consistent for ${\overline{R}}_{N}\left(q\right),$ where ${r}_{j}\left(q\right)$ denotes the sample of ${R}_{i}\left(q\right).$ Note that ${r}_{j}\left(q\right)$ and hence ${\overline{r}}_{n}\left(q\right),$ ${R}_{i}\left(q\right)$ and ${\overline{R}}_{N}\left(q\right)$ also depend on $\alpha$ which is suppressed in the notation for readability. Let ${q}_{0}$ be the minimum of ${\overline{R}}_{N}\left(q\right)$ which is not necessarily unique due to the finite structure of the population. We can take the “inf” argument, i.e., ${q}_{0}=\mathrm{inf}\left\{\text{arg}\text{\hspace{0.17em}}\mathrm{min}{\overline{R}}_{N}\left(q\right)\right\},$ but for simplicity we assume a superpopulation model (see Isaki and Fuller 1982) by considering the finite population to be a sample from an infinite superpopulation. In the latter we assume that survey variable $Y$ has a continuous cumulative distribution function so ${q}_{0}$ results in a unique $\alpha$ quantile. We get for $\delta >0$

$P\left({\overline{r}}_{n}\left({q}_{0}\right)<{\overline{r}}_{n}\left({q}_{0}-\delta \right)\right)⇔P\left(\frac{1}{N}\sum _{j=1}^{n}\text{\hspace{0.17em}}\frac{1}{{\pi }_{j}}\left\{{r}_{j}\left({q}_{0}\right)-{r}_{j}\left({q}_{0}-\delta \right)\right\}<0\right).$

Note that the argument in the probability statement is a design-consistent estimate for ${\overline{R}}_{N}\left({q}_{0}\right)-{\overline{R}}_{N}\left({q}_{0}-\delta \right),$ which is less than zero since ${q}_{0}$ is the minimum of ${\overline{R}}_{N}\left(\cdot \right).$ Hence, the probability tends to one in the sense of design consistency defined in Isaki and Fuller (1982). The same holds of course for $\delta <0.$ With this statement we may conclude that the estimated minimum ${\stackrel{^}{q}}_{0}=\text{arg}\text{\hspace{0.17em}}\mathrm{min}{\sum }_{j=1}^{n}1/{\pi }_{j}\text{\hspace{0.17em}}{r}_{j}\left(q\right)$ is a design-consistent estimate for ${q}_{0}$ so that ${\stackrel{^}{Q}}_{N}\left(\alpha \right)$ in (2.2) is in turn design-consistent for ${Q}_{N}\left(\alpha \right).$ It is easily shown that ${\stackrel{^}{Q}}_{N}\left(\alpha \right)$ is the inverse of the normed weighted cumulative distribution function

${\stackrel{^}{F}}_{N}\left(y\right):=\frac{\sum _{j=1}^{n}\text{\hspace{0.17em}}1\left\{{y}_{j}\le y\right\}/{\pi }_{j}}{\sum _{j=1}^{n}\text{\hspace{0.17em}}1/{\pi }_{j}}$

using the same notation as in Kuk (1988). Note that ${\stackrel{^}{F}}_{N}\left(y\right)$ is the Hajek (1971) estimate of the cumulative distribution function (see also Rao and Wu 2009) and as such not a Horvitz-Thompson estimate. As a consequence ${\stackrel{^}{Q}}_{N}\left(\alpha \right)$ is not design-unbiased. Nonetheless, ${\stackrel{^}{F}}_{N}\left(y\right)$ is a valid distribution function, and hence it can be considered as normalized version of the Lahiri or Horvitz-Thompson estimator of the distribution function (see Lahiri 1951) which is denoted by

${\stackrel{^}{F}}_{L}\left(y\right):=\frac{1}{N}\sum _{j=1}^{n}1/{\pi }_{j}1\left\{{y}_{j}\le y\right\}.$

Kuk (1988) proposes to replace ${\stackrel{^}{F}}_{L}\left(\cdot \right)$ with alternative estimates of the distribution function: Instead of estimating the distribution function itself he suggests to estimate the complementary proportion ${\stackrel{^}{S}}_{R}\left(y\right)$ which then leads to the estimate ${\stackrel{^}{F}}_{R}\left(y\right)$ defined through

${\stackrel{^}{F}}_{R}\left(y\right)=1-{\stackrel{^}{S}}_{R}\left(y\right)=1-\frac{1}{N}\sum _{j=1}^{n}1/{\pi }_{j}1\left\{{y}_{j}>y\right\}.$

Resulting directly from these definitions we can express ${\stackrel{^}{F}}_{R}\left(\cdot \right)$ in terms of ${\stackrel{^}{F}}_{N}\left(\cdot \right)$ through

${\stackrel{^}{F}}_{R}=1-\frac{1}{N}\sum _{j=1}^{n}1/{\pi }_{j}+{\stackrel{^}{F}}_{L}\text{ }\text{and}\text{ }{\stackrel{^}{F}}_{L}=\frac{\sum _{j=1}^{n}1/{\pi }_{j}}{N}{\stackrel{^}{F}}_{N}.\text{ }\text{ }\text{ }\text{ }\text{ }\left(2.3\right)$

Kuk (1988) shows that, under sampling with unequal probabilities, estimation of the median derived from ${\stackrel{^}{F}}_{R}$ outperforms median estimates derived from ${\stackrel{^}{F}}_{N}$ and ${\stackrel{^}{F}}_{L}$ in terms of mean squared estimation error. Note that the estimators ${\stackrel{^}{F}}_{N},$ ${\stackrel{^}{F}}_{L}$ and ${\stackrel{^}{F}}_{R}$ coincide in the case of simple random sampling without replacement where ${\pi }_{j}=\pi =n/N.$

Date modified: