# 3 Generalized regression estimation

Jan de Haan and Rens Hendriks

## 3.1 A simple GREG method

In this section we will outline an alternative approach to measuring house price change that makes use of appraisal data. The appraisals now serve as auxiliary information in a generalized regression (GREG) framework. Consider the following simple two-variable linear regression model:

$${p}_{n}^{0}={\alpha}^{0}+{\beta}^{0}{a}_{n}^{0}+{\epsilon}_{n}^{0},\text{}\text{}\left(3.1\right)$$

where ${\epsilon}_{n}^{0}$ is the error term. Unlike hedonic regression models, which postulate a causal relation between the selling price ${p}_{n}^{0}$ and a set of characteristics relating to the structure and the location of the housing units, this model does not say anything about how house prices are generated; equation (3.1) is merely a descriptive model.

Estimating model (3.1) by least squares regression on the data of sample ${S}^{0}$ yields predicted prices

$${\widehat{p}}_{n}^{0}={\widehat{\alpha}}^{0}+{\widehat{\beta}}^{0}{a}_{n}^{0}.\text{}\text{}\left(3.2\right)$$

The regression residuals for $n\in {S}^{0}$ are ${e}_{n}^{0}={p}_{n}^{0}-{\widehat{p}}_{n}^{0}.$ Assuming random sampling, as before, we can write the Horvitz-Thompson estimator ${\sum}_{n\in {S}^{0}}{p}_{n}^{0}}/{n}^{0$ of the mean value ${\sum}_{n\in {U}^{0}}{p}_{n}^{0}}/{N}^{0$ as

$$\sum _{n\in {S}^{0}}{p}_{n}^{0}/{n}^{0}}={\displaystyle \sum _{n\in {S}^{0}}{\widehat{p}}_{n}^{0}}/{n}^{0}+{\displaystyle \sum _{n\in {S}^{0}}{e}_{n}^{0}/{n}^{0}}={\widehat{\alpha}}^{0}+{\widehat{\beta}}^{0}{\displaystyle \sum _{n\in {S}^{0}}{a}_{n}^{0}/{n}^{0}}+{\displaystyle \sum _{n\in {S}^{0}}{e}_{n}^{0}/{n}^{0}}.\text{}\text{}\text{}\left(3.3\right)$$

Replacing the sample average of appraisals, ${\sum}_{n\in {S}^{0}}{a}_{n}^{0}}/{n}^{0},$ by its population counterpart ${\sum}_{n\in {U}^{0}}{a}_{n}^{0}}/{N}^{0$ yields the generalized regression (GREG) estimator:

$${\widehat{\overline{p}}}_{\text{GREG}}^{0}={\widehat{\alpha}}^{0}+{\widehat{\beta}}^{0}{\displaystyle \sum _{n\in {U}^{0}}{a}_{n}^{0}}/{N}^{0}+{\displaystyle \sum _{n\in {S}^{0}}{e}_{n}^{0}/{n}^{0}}={\displaystyle \sum _{n\in {U}^{0}}{\widehat{p}}_{n}^{0}/{N}^{0}}+{\displaystyle \sum _{n\in {S}^{0}}{e}_{n}^{0}/{n}^{0}}.\text{}\text{}\text{}\left(3.4\right)$$

Model-assisted sampling theory shows that GREG
estimators are *asymptotically design
unbiased* (Särndal, *et al.*
1992), irrespective of the choice of regressors. Unless the sample would be
small, the bias can be neglected. It is obvious that the GREG estimator (3.4)
will be more efficient $\u2013$ in the sense that it has a lower variance $\u2013$ than the Horvitz-Thompson estimator (3.3). As
a result, the GREG estimator will usually outperform the Horvitz-Thompson
estimator in terms of the mean square error (the sum of the variance and the
squared bias).

The same procedure can be applied to the comparison period $t.$ After estimating the model

$${p}_{n}^{t}={\alpha}^{t}+{\beta}^{t}{a}_{n}^{0}+{\epsilon}_{n}^{t}\text{}\text{}\text{}\left(3.5\right)$$

through least squares regression on the data of the current period sample ${S}^{t},$ we obtain predicted prices

$${\widehat{p}}_{n}^{t}={\widehat{\alpha}}^{t}+{\widehat{\beta}}^{t}{a}_{n}^{0},\text{}\text{}\text{}\left(3.6\right)$$

which lead to the GREG estimator of the mean value of the housing stock in period $t:$

$${\widehat{\overline{p}}}_{\text{GREG}}^{t}={\widehat{\alpha}}^{t}+{\widehat{\beta}}^{t}{\displaystyle \sum _{n\in {U}^{t}}{a}_{n}^{0}}/{N}^{t}+{\displaystyle \sum _{n\in {S}^{t}}{e}_{n}^{t}/{n}^{t}}={\displaystyle \sum _{n\in {U}^{t}}{\widehat{p}}_{n}^{t}}/{N}^{t}+{\displaystyle \sum _{n\in {S}^{t}}{e}_{n}^{t}/{n}^{t}},\text{}\text{}\text{}\left(3.7\right)$$

where ${e}_{n}^{t}={p}_{n}^{t}-{\widehat{p}}_{n}^{t}$ denote the period $t$ regression residuals. For a fixed housing stock we have ${U}^{t}={U}^{0},$ hence ${\sum}_{n\in {U}^{t}}{a}_{n}^{0}}/{N}^{t}={\displaystyle {\sum}_{n\in {U}^{0}}{a}_{n}^{0}}/{N}^{0},$ and it follows that

$${\widehat{\overline{p}}}_{\text{GREG}}^{t}={\widehat{\alpha}}^{t}+{\widehat{\beta}}^{t}{\displaystyle \sum _{n\in {U}^{0}}{a}_{n}^{0}}/{N}^{0}+{\displaystyle \sum _{n\in {S}^{t}}{e}_{n}^{t}/{n}^{t}}={\displaystyle \sum _{n\in {U}^{0}}{\widehat{p}}_{n}^{t}}/{N}^{0}+{\displaystyle \sum _{n\in {S}^{t}}{e}_{n}^{t}/{n}^{t}}.\text{}\text{}\left(3.8\right)$$

The GREG estimator of house price change results simply from taking the ratio of equations (3.8) and (3.4):

$${\widehat{P}}_{\text{GREG}}^{0t}=\frac{{\widehat{\overline{p}}}_{\text{GREG}}^{t}}{{\widehat{\overline{p}}}_{\text{GREG}}^{0}}=\frac{{\widehat{\alpha}}^{t}+{\widehat{\beta}}^{t}{\overline{a}}^{0}+{\displaystyle \sum _{n\in {S}^{t}}{e}_{n}^{t}/{n}^{t}}}{{\widehat{\alpha}}^{0}+{\widehat{\beta}}^{0}{\overline{a}}^{0}+{\displaystyle \sum _{n\in {S}^{0}}{e}_{n}^{0}/{n}^{0}}}=\frac{{\displaystyle \sum _{n\in {U}^{0}}{\widehat{p}}_{n}^{t}/{N}^{0}}+{\displaystyle \sum _{n\in {S}^{t}}{e}_{n}^{t}/{n}^{t}}}{{\displaystyle \sum _{n\in {U}^{0}}{\widehat{p}}_{n}^{0}/{N}^{0}}+{\displaystyle \sum _{n\in {S}^{0}}{e}_{n}^{0}/{n}^{0}}},\text{}\text{}\text{}\left(3.9\right)$$

where ${\overline{a}}^{0}={\displaystyle {\sum}_{n\in {U}^{0}}{a}_{n}^{0}}/{N}^{0}\text{}.$ Some additional small sample bias will be introduced due to the non-linear (ratio) structure. When using Ordinary Least Squares (OLS) regression to estimate the models (3.1) and (3.5), the unweighted sample means of regression residuals in (3.9), ${\sum}_{n\in {S}^{0}}{e}_{n}^{0}}/{n}^{0$ and ${\sum}_{n\in {S}^{t}}{e}_{n}^{t}}/{n}^{t},$ will be equal to 0 and the GREG index reduces to

$${\widehat{P}}_{\text{GREG,OLS}}^{0t}=\frac{{\displaystyle \sum _{n\in {U}^{0}}{\widehat{p}}_{n}^{t}}/{N}^{0}}{{\displaystyle \sum _{n\in {U}^{0}}{\widehat{p}}_{n}^{0}}/{N}^{0}}=\frac{{\widehat{\alpha}}^{t}+{\widehat{\beta}}^{t}{\overline{a}}^{0}}{{\widehat{\alpha}}^{0}+{\widehat{\beta}}^{0}{\overline{a}}^{0}}=\frac{{\widehat{\alpha}}^{t}/{\overline{a}}^{0}+{\widehat{\beta}}^{t}}{{\widehat{\alpha}}^{0}/{\overline{a}}^{0}+{\widehat{\beta}}^{0}}.\text{}\text{}\text{}\left(3.10\right)$$

As the first expression on the right-hand side of (3.10)
indicates, the (OLS) GREG approach essentially imputes prices pertaining to the
base period and the current period using equations (3.2) and (3.6). The
difference with the hedonic *double
imputation* method is twofold: a descriptive model, not a hedonic one, is
used to estimate predicted prices $\u2013$ so that we cannot speak of unbiased predicted
prices $\u2013$ and prices are imputed for all houses of the
housing stock instead of the sub-set of sampled houses.

## 3.2 Properties of the GREG index

The (OLS) GREG index has several properties worth mentioning. First, the computation of the GREG index is very simple. Once the population mean of appraisals ${\overline{a}}^{0}$ and the base period regression coefficients ${\widehat{\alpha}}^{0}$ and ${\widehat{\beta}}^{0}$ have been calculated, all that is needed is running a regression each month of selling prices against appraisals and plugging the coefficients ${\widehat{\alpha}}^{t}$ and ${\widehat{\beta}}^{t}$ into (3.10). Note that the GREG index can be written as a pseudo chain index:

$${\widehat{P}}_{\text{GREG,OLS}}^{0t}=\frac{{\widehat{\alpha}}^{t}/{\overline{a}}^{0}+{\widehat{\beta}}^{t}}{{\widehat{\alpha}}^{0}/{\overline{a}}^{0}+{\widehat{\beta}}^{0}}={\displaystyle \prod _{\tau =1}^{t}\frac{{\widehat{\alpha}}^{\tau}/{\overline{a}}^{0}+{\widehat{\beta}}^{\tau}}{{\widehat{\alpha}}^{\tau -1}/{\overline{a}}^{0}+{\widehat{\beta}}^{\tau -1}}}.\text{}\text{}\text{}\left(3.11\right)$$

This
can be helpful in practice, particularly when new appraisal data becomes
available. New appraisal data often becomes available to the statistical
agency with a considerable time lag, up to more than a year. There are two
reasons for using the latest appraisal information. The quality of the
appraisals may improve over time, which seems to have been the case in the
Netherlands (de Vries *et al*.
2009). Also, the assumption of a fixed housing stock can be relaxed so that
newly-built properties can be incorporated through chaining; the resulting
chained GREG index takes the dynamics of the housing stock into account. The
same advantages of chaining apply to the SPAR method. Suppose new appraisals, relating to period $T(0<T\le t),$ are available in period $t+1.$ The time series can then be
updated through chain-linking, *i.e.*, by multiplying ${\widehat{P}}_{\text{GREG,OLS}}^{0t}$ by the month-to-month change $({\tilde{\alpha}}^{t+1}/{\overline{a}}^{T}+{\tilde{\beta}}^{t+1})/({\tilde{\alpha}}^{t}/{\overline{a}}^{T}+{\tilde{\beta}}^{t}),$ where the coefficients now
pertain to a regression of selling prices on the period $T$ appraisals.

Second, *standard
errors* of the GREG index can be estimated rather easily using the variance-covariance
matrix of the regression coefficients, which is standard output of most
statistical packages. An expression for the approximate standard error is
derived in the Appendix. The standard error of the GREG index depends on the
goodness of fit $({R}^{2})$ of the regression model. It is most likely
that ${R}^{2}$ for the base period regression is higher than
that for the current period regressions. This is because we expect to find a
strong linear relation between appraisals and sale prices in the appraisal
reference period while in later periods this relation will probably be weaker
due to differing price trends across different types of houses or regions. The derivation of approximate
standard errors for the SPAR index is a bit more complex because there is an
additional source of sampling error, namely the sampling variability of the
mean appraisals; see de Haan (2007).

The latter point brings us to the third property of the
GREG index, namely its dependence on the *quality
of the appraisal data*. For two reasons at least the appraisals may not
exactly represent the transaction prices during the base period so that the
model fit is not perfect $({R}^{2}<1).$ The assessment authorities may not have (real
time) access to the actual sale prices and therefore have to make their own
judgements based on other information. But even if they knew the selling
prices, the authorities may still decide to make adjustments when determining
the property values. It can be argued
that selling prices do not always properly measure the unknown market values $\u2013$ which
can be seen as a latent variable $\u2013$ and
tend to be more volatile. In this respect, Francke (2010) and others have used the term
transaction noise.

The way in which
the appraisals have been determined will affect the standard error of the GREG
index. As long as the quality of the appraisal data is the same for all houses
in stock, no bias arises since the appraisals only serve as an auxiliary
variable in regressions run on the samples ${S}^{0}$ and ${S}^{t}$ of properties sold in periods 0 and $t(t=1,\dots ,T).$ However, in general we expect the quality of
the appraisals to be higher for properties belonging to the appraisal reference
(base) period sample ${S}^{0},$ although this will most likely differ across
valuation methods. In the
Netherlands the properties are assessed for tax purposes, both for income tax
and local taxes. The municipalities are responsible for the valuations. Several
municipalities value the houses which are sold during the reference period
(January) by the selling price. Houses which were not sold are sometimes valued
by comparing them to similar traded houses. Some municipalities apparently use
a form of hedonic regression to value the houses, but the methodology is
unfortunately not made publicly available. For more information on the Dutch
appraisal system, see de Vries *et al*.
(2009).

So far we have assumed that the quality of the
individual houses stays the same over time. This is a strong assumption. Thus,
the fourth property $\u2013$ and most important drawback $\u2013$ of the GREG method is that the resulting price
index suffers from *quality change bias*
since explicit quality adjustments are not carried out. The same drawback holds
true for the SPAR method and for the standard repeat sales method. In
principle, hedonic regression methods can deal with the quality change problem,
although it may prove difficult to control for all relevant price determining
characteristics, in particular micro location. The SPAR method automatically
controls for micro location, provided of course that the appraisals sufficiently
account for this, as it is based on the matched-model methodology where the
matching is done at the address level.

## 3.3 Alternative GREG estimators

Statistics Netherlands not only computes house price
indexes for the whole country but also for segments of the housing market,
according to type of house (family dwellings and apartments) and region
(provinces and large cities), mainly because of user needs. Another motivation
behind stratifying the sample can be to mitigate the effect of *sample selection bias*. This type of bias
may arise if the set of houses sold in a particular period is not a random
selection from the housing stock. The nationwide index should then be
indirectly computed as a weighted average of the stratum indexes instead of
directly from all observations.

Suppose the total housing stock ${U}^{0}$ is sub-divided into $K$ non-overlapping strata ${U}_{k}^{0}$ of size ${N}_{k}^{0}\left({\displaystyle {\sum}_{k=1}^{K}{N}_{k}^{0}={N}^{0}}\right).$ The target price index (2.3) can now be rewritten as

$${P}^{0t}=\frac{{\displaystyle \sum _{n\in {U}^{0}}{p}_{n}^{t}}}{{\displaystyle \sum _{n\in {U}^{0}}{p}_{n}^{0}}}=\frac{{\displaystyle \sum _{k=1}^{K}{\displaystyle \sum _{n\in {U}_{k}^{0}}{p}_{n}^{t}}}}{{\displaystyle \sum _{k=1}^{K}{\displaystyle \sum _{n\in {U}_{k}^{0}}{p}_{n}^{0}}}}={\displaystyle \sum _{k=1}^{K}{s}_{k}^{0}{P}_{k}^{0t}},\text{}\text{}\text{}\left(3.12\right)$$

where ${P}_{k}^{0t}={\displaystyle {\sum}_{n\in {U}_{k}^{0}}{p}_{n}^{t}}/{\displaystyle {\sum}_{n\in {U}_{k}^{0}}{p}_{n}^{0}}$ is the target price index for stratum ${U}_{k}^{0}(k=1,\dots ,K).$ The base period stock value shares ${s}_{k}^{0}={\displaystyle {\sum}_{n\in {U}_{k}^{0}}{p}_{n}^{0}}/{\displaystyle {\sum}_{n\in {U}^{0}}{p}_{n}^{0}},$ which serve as weights for the stratum
indexes, are unknown and have to be estimated. Assuming the variables that
define the strata are known for all $n\in {U}^{0},$ a natural choice for the weights would be the
appraisal shares ${\widehat{s}}_{k}^{0}={\displaystyle {\sum}_{n\in {U}_{k}^{0}}{a}_{n}^{0}}/{\displaystyle {\sum}_{n\in {U}^{0}}{a}_{n}^{0}}=\left({N}_{k}^{0}/{N}^{0}\right)\left({\overline{a}}_{k}^{0}/{\overline{a}}^{0}\right).$ Obviously, the stratum-defining housing variables should be included in the
appraisal data set. In the

Statistical techniques such as GREG estimation are typically applied to estimate totals or means for small domains for which the number of observations is so small that the standard errors using traditional (Horvitz-Thompson) estimators $\u2013$ in our case the ratio of sample means $\u2013$ would become unacceptably high. It should be mentioned that, even with the GREG method, the stratification scheme should not be too detailed since that might unduly raise the variance of the stratum indexes and hence of the aggregate index. More importantly perhaps, small sample bias will increase and may become non-negligible with very small samples.

OLS regressions of selling prices on appraisals should now be run in every time period for each stratum in order to compute the aggregate GREG index. The stratified (OLS) GREG index is

$${\widehat{P}}_{\text{StrGREG}}^{0t}={\displaystyle \sum _{k=1}^{K}{\widehat{s}}_{k}^{0}{\widehat{P}}_{k,\text{GREG,OLS}}^{0t}}={\displaystyle \sum _{k=1}^{K}{\widehat{s}}_{k}^{0}\left(\frac{{\widehat{\alpha}}_{k}^{t}/{\overline{a}}_{k}^{0}+{\widehat{\beta}}_{k}^{t}}{{\widehat{\alpha}}_{k}^{0}/{\overline{a}}_{k}^{0}+{\widehat{\beta}}_{k}^{0}}\right)};\text{}\text{}\text{}\left(3.13\right)$$

Differences in the slope coefficients ${\widehat{\beta}}_{k}^{s}(s=0,t)$ across the strata could be the result of sampling error or reflect a real phenomenon. The latter can be of particular importance for periods $t$ which are very distant from period 0 as different housing market segments tend to show varying price trends. Whether any differences in the slope coefficients reflect a real phenomenon could be tested.

An alternative model, to be estimated on the entire data set, is one with a single intercept term, but where the $\beta \u2019\text{s}$ are allowed to differ across the strata. Let ${D}_{n,k}$ be a dummy variable that has the value 1 if property $n$ belongs to stratum $k$ and 0 otherwise. In period $s(s=0,t)$ the model

$${p}_{n}^{s}={\alpha}^{s}+{\displaystyle \sum _{k=1}^{K}{\beta}_{k}^{s}{D}_{n,k}{a}_{n}^{0}}+{\epsilon}_{n}^{s}\text{}\text{}\left(3.14\right)$$

is estimated by OLS regression on the data of the sample ${S}^{s},$ yielding predicted prices ${\tilde{p}}_{n}^{s}={\tilde{\alpha}}^{s}+{\tilde{\beta}}_{k}^{s}{a}_{n}^{0}$ for $n\in {U}_{k}^{0}.$ The residuals again sum to zero and the new (unstratified) OLS GREG index becomes

$${\tilde{P}}_{\text{GREG,OLS}}^{0t}=\frac{{\displaystyle \sum _{n\in {U}^{0}}{\tilde{p}}_{n}^{t}/{N}^{0}}}{{\displaystyle \sum _{n\in {U}^{0}}{\tilde{p}}_{n}^{0}/{N}^{0}}}=\frac{{\displaystyle \sum _{k=1}^{K}{\displaystyle \sum _{n\in {U}_{k}^{0}}{\tilde{p}}_{n}^{t}/{N}^{0}}}}{{\displaystyle \sum _{k=1}^{K}{\displaystyle \sum _{n\in {U}_{k}^{0}}{\tilde{p}}_{n}^{0}/{N}^{0}}}}=\frac{{\tilde{\alpha}}^{t}+{\displaystyle \sum _{k=1}^{K}\left(\frac{{N}_{k}^{0}}{{N}^{0}}\right){\tilde{\beta}}_{k}^{t}{\overline{a}}_{k}^{0}}}{{\tilde{\alpha}}^{0}+{\displaystyle \sum _{k=1}^{K}\left(\frac{{N}_{k}^{0}}{{N}^{0}}\right){\tilde{\beta}}_{k}^{0}{\overline{a}}_{k}^{0}}}.\text{}\text{}\text{}\left(3.15\right)$$

Model (3.14) is more flexible than the original model
given by equations (3.1) and (3.5), and could be useful if the proportionality
between sale prices and appraisals fails. Estimator (3.15) reduces to the
original GREG index (3.10) if the ${\tilde{\beta}}_{k}^{s}\u2019\text{s}$ are all equal. In practice this will not
happen, and (3.15) and (3.10) will give different answers. A common
justification for the use of GREG estimators is that, being asymptotically
unbiased, they are relatively *robust to
model choice*. So we would expect the impact of the alternative model
specification (3.15) to be moderate. On the other hand, it is well recognized
in the literature that model dependence can be an issue under specific
circumstances, notably when dealing with highly variable and outlier-prone
populations. For example, Hedlin, Falvey, Chambers and Kokic (2001)
stress the importance of a careful model specification search while Beaumont and Alavi
(2004) focus on the treatment of outliers. It would therefore be
worthwhile examining the effect of this alternative model specification.

## Report a problem on this page

Is something not working? Is there information outdated? Can't find what you're looking for?

Please contact us and let us know how we can help you.

- Date modified: