# 3 Generalized regression estimation

Jan de Haan and Rens Hendriks

## 3.1  A simple GREG method

In this section we will outline an alternative approach to measuring house price change that makes use of appraisal data. The appraisals now serve as auxiliary information in a generalized regression (GREG) framework. Consider the following simple two-variable linear regression model:

where ${\epsilon }_{n}^{0}$ is the error term. Unlike hedonic regression models, which postulate a causal relation between the selling price ${p}_{n}^{0}$ and a set of characteristics relating to the structure and the location of the housing units, this model does not say anything about how house prices are generated; equation (3.1) is merely a descriptive model.

Estimating model (3.1) by least squares regression on the data of sample ${S}^{0}$ yields predicted prices

The regression residuals for $n\in {S}^{0}$ are ${e}_{n}^{0}={p}_{n}^{0}-{\stackrel{^}{p}}_{n}^{0}.$ Assuming random sampling, as before, we can write the Horvitz-Thompson estimator ${\sum }_{n\in {S}^{0}}{p}_{n}^{0}/{n}^{0}$ of the mean value ${\sum }_{n\in {U}^{0}}{p}_{n}^{0}/{N}^{0}$ as

Replacing the sample average of appraisals, ${\sum }_{n\in {S}^{0}}{a}_{n}^{0}/{n}^{0},$ by its population counterpart ${\sum }_{n\in {U}^{0}}{a}_{n}^{0}/{N}^{0}$ yields the generalized regression (GREG) estimator:

Model-assisted sampling theory shows that GREG estimators are asymptotically design unbiased (Särndal, et al. 1992), irrespective of the choice of regressors. Unless the sample would be small, the bias can be neglected. It is obvious that the GREG estimator (3.4) will be more efficient $–$ in the sense that it has a lower variance $–$ than the Horvitz-Thompson estimator (3.3). As a result, the GREG estimator will usually outperform the Horvitz-Thompson estimator in terms of the mean square error (the sum of the variance and the squared bias).

The same procedure can be applied to the comparison period $t.$ After estimating the model

through least squares regression on the data of the current period sample ${S}^{t},$ we obtain predicted prices

which lead to the GREG estimator of the mean value of the housing stock in period $t:$

where ${e}_{n}^{t}={p}_{n}^{t}-{\stackrel{^}{p}}_{n}^{t}$ denote the period $t$ regression residuals. For a fixed housing stock we have ${U}^{t}={U}^{0},$ hence ${\sum }_{n\in {U}^{t}}{a}_{n}^{0}/{N}^{t}={\sum }_{n\in {U}^{0}}{a}_{n}^{0}/{N}^{0},$ and it follows that

The GREG estimator of house price change results simply from taking the ratio of equations (3.8) and (3.4):

where ${\overline{a}}^{0}={\sum }_{n\in {U}^{0}}{a}_{n}^{0}/{N}^{0}\text{​}.$ Some additional small sample bias will be introduced due to the non-linear (ratio) structure. When using Ordinary Least Squares (OLS) regression to estimate the models (3.1) and (3.5), the unweighted sample means of regression residuals in (3.9), ${\sum }_{n\in {S}^{0}}{e}_{n}^{0}/{n}^{0}$ and ${\sum }_{n\in {S}^{t}}{e}_{n}^{t}/{n}^{t},$ will be equal to 0 and the GREG index reduces to

As the first expression on the right-hand side of (3.10) indicates, the (OLS) GREG approach essentially imputes prices pertaining to the base period and the current period using equations (3.2) and (3.6). The difference with the hedonic double imputation method is twofold: a descriptive model, not a hedonic one, is used to estimate predicted prices $–$ so that we cannot speak of unbiased predicted prices $–$ and prices are imputed for all houses of the housing stock instead of the sub-set of sampled houses.

## 3.2  Properties of the GREG index

The (OLS) GREG index has several properties worth mentioning. First, the computation of the GREG index is very simple. Once the population mean of appraisals ${\overline{a}}^{0}$ and the base period regression coefficients ${\stackrel{^}{\alpha }}^{0}$ and ${\stackrel{^}{\beta }}^{0}$ have been calculated, all that is needed is running a regression each month of selling prices against appraisals and plugging the coefficients ${\stackrel{^}{\alpha }}^{t}$ and ${\stackrel{^}{\beta }}^{t}$ into (3.10). Note that the GREG index can be written as a pseudo chain index:

This can be helpful in practice, particularly when new appraisal data becomes available. New appraisal data often becomes available to the statistical agency with a considerable time lag, up to more than a year. There are two reasons for using the latest appraisal information. The quality of the appraisals may improve over time, which seems to have been the case in the Netherlands (de Vries et al. 2009). Also, the assumption of a fixed housing stock can be relaxed so that newly-built properties can be incorporated through chaining; the resulting chained GREG index takes the dynamics of the housing stock into account. The same advantages of chaining apply to the SPAR method. Suppose new appraisals, relating to period $T\left(0 are available in period $t+1.$ The time series can then be updated through chain-linking, i.e., by multiplying ${\stackrel{^}{P}}_{\text{GREG,OLS}}^{0t}$ by the month-to-month change $\left({\stackrel{˜}{\alpha }}^{t+1}/{\overline{a}}^{T}+{\stackrel{˜}{\beta }}^{t+1}\right)/\left({\stackrel{˜}{\alpha }}^{t}/{\overline{a}}^{T}+{\stackrel{˜}{\beta }}^{t}\right),$ where the coefficients now pertain to a regression of selling prices on the period $T$ appraisals.

Second, standard errors of the GREG index can be estimated rather easily using the variance-covariance matrix of the regression coefficients, which is standard output of most statistical packages. An expression for the approximate standard error is derived in the Appendix. The standard error of the GREG index depends on the goodness of fit $\left({R}^{2}\right)$ of the regression model. It is most likely that ${R}^{2}$ for the base period regression is higher than that for the current period regressions. This is because we expect to find a strong linear relation between appraisals and sale prices in the appraisal reference period while in later periods this relation will probably be weaker due to differing price trends across different types of houses or regions. The derivation of approximate standard errors for the SPAR index is a bit more complex because there is an additional source of sampling error, namely the sampling variability of the mean appraisals; see de Haan (2007).

The latter point brings us to the third property of the GREG index, namely its dependence on the quality of the appraisal data. For two reasons at least the appraisals may not exactly represent the transaction prices during the base period so that the model fit is not perfect $\left({R}^{2}<1\right).$ The assessment authorities may not have (real time) access to the actual sale prices and therefore have to make their own judgements based on other information. But even if they knew the selling prices, the authorities may still decide to make adjustments when determining the property values. It can be argued that selling prices do not always properly measure the unknown market values $–$ which can be seen as a latent variable $–$ and tend to be more volatile. In this respect, Francke (2010) and others have used the term transaction noise.

The way in which the appraisals have been determined will affect the standard error of the GREG index. As long as the quality of the appraisal data is the same for all houses in stock, no bias arises since the appraisals only serve as an auxiliary variable in regressions run on the samples ${S}^{0}$ and ${S}^{t}$ of properties sold in periods 0 and $t\left(t=1,\dots ,T\right).$ However, in general we expect the quality of the appraisals to be higher for properties belonging to the appraisal reference (base) period sample ${S}^{0},$ although this will most likely differ across valuation methods. In the Netherlands the properties are assessed for tax purposes, both for income tax and local taxes. The municipalities are responsible for the valuations. Several municipalities value the houses which are sold during the reference period (January) by the selling price. Houses which were not sold are sometimes valued by comparing them to similar traded houses. Some municipalities apparently use a form of hedonic regression to value the houses, but the methodology is unfortunately not made publicly available. For more information on the Dutch appraisal system, see de Vries et al. (2009).

So far we have assumed that the quality of the individual houses stays the same over time. This is a strong assumption. Thus, the fourth property $–$ and most important drawback $–$ of the GREG method is that the resulting price index suffers from quality change bias since explicit quality adjustments are not carried out. The same drawback holds true for the SPAR method and for the standard repeat sales method. In principle, hedonic regression methods can deal with the quality change problem, although it may prove difficult to control for all relevant price determining characteristics, in particular micro location. The SPAR method automatically controls for micro location, provided of course that the appraisals sufficiently account for this, as it is based on the matched-model methodology where the matching is done at the address level.

## 3.3  Alternative GREG estimators

Statistics Netherlands not only computes house price indexes for the whole country but also for segments of the housing market, according to type of house (family dwellings and apartments) and region (provinces and large cities), mainly because of user needs. Another motivation behind stratifying the sample can be to mitigate the effect of sample selection bias. This type of bias may arise if the set of houses sold in a particular period is not a random selection from the housing stock. The nationwide index should then be indirectly computed as a weighted average of the stratum indexes instead of directly from all observations.

Suppose the total housing stock ${U}^{0}$ is sub-divided into $K$ non-overlapping strata ${U}_{k}^{0}$ of size ${N}_{k}^{0}\left({\sum }_{k=1}^{K}{N}_{k}^{0}={N}^{0}\right).$ The target price index (2.3) can now be rewritten as

where ${P}_{k}^{0t}={\sum }_{n\in {U}_{k}^{0}}{p}_{n}^{t}/{\sum }_{n\in {U}_{k}^{0}}{p}_{n}^{0}$ is the target price index for stratum ${U}_{k}^{0}\left(k=1,\dots ,K\right).$ The base period stock value shares ${s}_{k}^{0}={\sum }_{n\in {U}_{k}^{0}}{p}_{n}^{0}/{\sum }_{n\in {U}^{0}}{p}_{n}^{0},$ which serve as weights for the stratum indexes, are unknown and have to be estimated. Assuming the variables that define the strata are known for all $n\in {U}^{0},$ a natural choice for the weights would be the appraisal shares ${\stackrel{^}{s}}_{k}^{0}={\sum }_{n\in {U}_{k}^{0}}{a}_{n}^{0}/{\sum }_{n\in {U}^{0}}{a}_{n}^{0}=\left({N}_{k}^{0}/{N}^{0}\right)\left({\overline{a}}_{k}^{0}/{\overline{a}}^{0}\right).$ Obviously, the stratum-defining housing variables should be included in the appraisal data set. In the Netherlands address and type of dwelling are included. This allows a sub-division of the population into cross classifications of location and type of dwelling. Appraisals may not always be accurate estimates of the 'true' market values of the individual properties but at the stratum level we expect the accuracy of the average appraisals to be sufficient for the computation of the weights.

Statistical techniques such as GREG estimation are typically applied to estimate totals or means for small domains for which the number of observations is so small that the standard errors using traditional (Horvitz-Thompson) estimators $–$ in our case the ratio of sample means $–$ would become unacceptably high. It should be mentioned that, even with the GREG method, the stratification scheme should not be too detailed since that might unduly raise the variance of the stratum indexes and hence of the aggregate index. More importantly perhaps, small sample bias will increase and may become non-negligible with very small samples.

OLS regressions of selling prices on appraisals should now be run in every time period for each stratum in order to compute the aggregate GREG index. The stratified (OLS) GREG index is

Differences in the slope coefficients ${\stackrel{^}{\beta }}_{k}^{s}\left(s=0,t\right)$ across the strata could be the result of sampling error or reflect a real phenomenon. The latter can be of particular importance for periods $t$ which are very distant from period 0 as different housing market segments tend to show varying price trends. Whether any differences in the slope coefficients reflect a real phenomenon could be tested.

An alternative model, to be estimated on the entire data set, is one with a single intercept term, but where the $\beta ’\text{s}$ are allowed to differ across the strata. Let ${D}_{n,k}$ be a dummy variable that has the value 1 if property $n$ belongs to stratum $k$ and 0 otherwise. In period $s\left(s=0,t\right)$ the model

is estimated by OLS regression on the data of the sample ${S}^{s},$ yielding predicted prices ${\stackrel{˜}{p}}_{n}^{s}={\stackrel{˜}{\alpha }}^{s}+{\stackrel{˜}{\beta }}_{k}^{s}{a}_{n}^{0}$ for $n\in {U}_{k}^{0}.$ The residuals again sum to zero and the new (unstratified) OLS GREG index becomes

Model (3.14) is more flexible than the original model given by equations (3.1) and (3.5), and could be useful if the proportionality between sale prices and appraisals fails. Estimator (3.15) reduces to the original GREG index (3.10) if the ${\stackrel{˜}{\beta }}_{k}^{s}’\text{s}$ are all equal. In practice this will not happen, and (3.15) and (3.10) will give different answers. A common justification for the use of GREG estimators is that, being asymptotically unbiased, they are relatively robust to model choice. So we would expect the impact of the alternative model specification (3.15) to be moderate. On the other hand, it is well recognized in the literature that model dependence can be an issue under specific circumstances, notably when dealing with highly variable and outlier-prone populations. For example, Hedlin, Falvey, Chambers and Kokic (2001) stress the importance of a careful model specification search while Beaumont and Alavi (2004) focus on the treatment of outliers. It would therefore be worthwhile examining the effect of this alternative model specification.

Is something not working? Is there information outdated? Can't find what you're looking for?