2007 Health Region Peer Groups – User Guide
Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please "contact us" to request a format other than those available.
The purpose of this document is to define the concept of peer groups, to give an overview of how they are created and to demonstrate their usefulness. This paper presents the 2007 classification of the peer groups and compares the results with the peer groups created in both 2003 and 2000. More detailed technical information on the formation of the peer groups can be found in the working papers Health Region Peer Groups and Health Region Peer Groups 2003 written by the Health Statistics Division of Statistics Canada.
The launch of the Canadian Community Health Survey (CCHS) in 2000, along with the expansion of existing data products at the health region level, lead to the desire for a method of comparing regions with similar socio-economic determinants of health. The reasoning behind the development of such a method is that once the effects of the various social and economic characteristics known to influence health have been removed, it is possible to compare regions by measures of health status. It is also possible to compare the relative effectiveness of health promotion and prevention activities across regions. Thus, the health regions have been placed into groups with similar socio-economic characteristics using a clustering technique, and these groups are referred to as 'peer groups'.
Development of the criteria used to define peer groups required careful consideration of their intended use. The requirement that peer groups be used as a method for comparing health related issues ultimately eliminated all variables directly describing health as potential candidates in the creation of the groups. Further, it was desired that all variables used must be reliable and available for all health regions. As well, the need for objectivity required that peer groups be developed using empirical techniques. Finally, consideration of the need for simplified and relevant comparisons also required that peer groups have approximately 5 to 10 health regions per group and that there be representation across the country within each group. In the application of the above parameters, several limiting factors resulted which required some modifications. All criteria were followed to the extent possible and any deviations are explained in detail throughout this document.
The original 2000 Peer Group Classification was released in 2002, and was based on the 1996 Census information as well as the health region boundaries as defined by the provinces and territories in 2000. In order to remain current with respect to data availability and the health region boundary changes, it is necessary to update the peer group classification over time. These updates have taken place in the form of the 2003 Peer Group Classification and the 2007 Peer Group Classification. The latest update to the peer groups was based on the 2006 Census data and the 2007 health region boundaries. The final result of this classification was the creation of ten peer groups, representing all health regions across Canada.
This document will give an overview of how the peer groups are created. The 2007 Peer Group Classification is presented and the results are compared with the peer groups created in the past. Finally, the use of the peer groups in the analysis of health related issues will be demonstrated through an example. More detailed technical information on the formation of the peer groups can be found in the working papers Health Region Peer Groups and Health Region Peer Groups 2003 written by the Health Statistics Division of Statistics Canada.
Twenty-four variables describing the socio-economic and socio-demographic determinants of health within the 123 health regions across Canada were used to produce the peer groups. The variables chosen for this task cover a wide range of areas including demographic structure, social and economic status, ethnicity, Aboriginal status, housing, urbanization, income inequality and labour market conditions. Note that health-related variables were deliberately not used in the creation of the peer groups. The variables used in the 2007 Peer Group Classification are outlined in Appendix A.
The Census is the main source of information for the creation of the peer groups. Census information is readily available at various levels of geography and covers a broad range of topics. As well, all health regions are effectively covered by the Census, which is one of the main data requirements in the creation of the peer groups. The 2007 Peer Group Classification used data from the 2006 Census.
Two new variables were used in 2007, in place of similar variables that were used in the previous peer group classification. The new variables fall into two separate areas:
The median household income (MedInc) for a health region was used instead of the average household income (AvgInc) for a health region. Median household income is more representative of the economic situation faced by families and communities within the region.
The proportion of the population under 20 (Pop20) has replaced the proportion of the population under 15 (Pop15) that was used in previous peer group classifications. This modification was made to reflect the change in the definition of the dependency ratio, which now measures the ratio of the combined child population aged 0 to 19 and population aged 65 and over to the population 20 to 64 years old. This ratio is presented as the number of dependents for every 100 people in the working age population.
A non-hierarchical cluster analysis was the method chosen to create the peer groups. Generally speaking, cluster analysis attempts to organize variables or observations into groups using measures of distance. Non-hierarchical algorithms attempt to partition a set of observations into a pre-defined set of disjoint groups using a specified optimization criterion. This approach appeared best suited to meet the original objectives of the peer group project, mainly to use an empirical technique to create a pre-defined number of peer groups with approximately 5 to 10 health regions within each group.
The peer groups were created in SAS using the FASTCLUS procedure. This procedure uses a k-means algorithm to assign observations to a pre-defined set of k clusters. A description of k-means clustering and several variants of the method can be found in Andberg, 1973. The basic steps for placing observations into k clusters are as follows:
- Select k observations as cluster seeds (the initial centers of the clusters).
- Assign observations to the nearest cluster seed. After all observations are assigned, cluster seeds are replaced by their respective cluster means. This step is repeated until the change in cluster seeds becomes or approaches zero.
- Form final clusters by assigning each observation to its nearest cluster seed.
Complete details of the FASTCLUS procedure can be found in the SAS OnlineDoc®, Version 9.
3.1 Number of Clusters
One of the major challenges with cluster analysis is selecting the appropriate number of initial clusters. Several criteria have been suggested (Everitt, 1993) which generally involve the optimization of one or more test statistics. From a practical perspective it is generally left up to the analyst to determine the number that best suits a given need. For the purpose of the 2007 Peer Group Classification a maximum of 17 clusters was chosen. This would give an average number of 7 health regions to each peer group1, which is in line with the study objectives. The maximum number of clusters used in 2007 was lower than that used in both 2003 and 2000 since there were fewer health regions in total.
4.1 Standardization of Variables
Variables measured on different scales, or on a common scale with differing variances, are often standardized in order to mitigate the effect of these differences among the variables. For this exercise, all 24 socio-economic variables were standardised (mean 0, variance 1) prior to performing the cluster analysis. Two variables could not be calculated for some of the more remote health regions: the proportion of low income persons in private households (LowPop) and the proportion of low income children (LowKids). This is because the Census does not derive low income data for the three territories and Indian reserves. Other remote areas can also be excluded from low income statistics if the data in that region is considered unreliable. For the two low income variables used in the peer group analysis, a value of zero appears in the file for the three territories as well as for regions 2417 and 2418. This value of zero is an indication that the variable could not be calculated.
4.2 Creation of Peer Groups
To establish a starting point, the clustering algorithm was instructed to group the 123 health regions into 17 clusters. Five of the resulting clusters contained only 1 health region. This indicated that 17 clusters was too many given that the objective of assigning peer groups is to be able to compare like health regions. The cluster analysis was rerun with a reduced number of cluster seeds.
The results of the final cluster analysis using PROC FASTCLUS can be seen in Table 4.2.1. The table shows the number of health regions contained in each peer group, as well as several statistics related to the clusters. The root mean square standard deviation (RMS Std) is a measure of the variability in the data points around the cluster center. The radius displays the largest Euclidean distance from the cluster center to any observation within the cluster. The nearest cluster refers to the closest peer group in terms of Euclidean distance. Finally, the last column of the table displays the distance between the current cluster center and that of its closest neighbour. For each of these statistics, the cluster center is the point having coordinates that are the means of all the observations in the cluster. Euclidean distance is a statistical measure of distance between two points.
There were two clusters that contained the majority of health regions (A and C). Both of these clusters were comprised of regions that were very similar (as both clusters were large in terms of the number of health regions and had low standard deviations). As well, these clusters were nearest neighbours and the distance between their cluster centers was small, demonstrating that the health regions in both clusters were also similar. Therefore, although these clusters did not meet the objective of having approximately 5 to 10 regions per peer group, there did not appear to be a valid reason to split them into smaller groups.
Note that the total number of health regions in Table 4.2.1 no longer equals 123. This is due to a new development that was added to the 2007 Peer Group Classification. There are two levels of geography in Nova Scotia: there are 6 Zones or 9 District Health Authorities (DHAs). Due to the relationship between the two levels of geography it was possible to incorporate both into the peer group classification. The relationship between the two levels of geography takes on two forms. The first is that there are 3 Zones (1202, 1205 and 1206) and 3 DHAs (1213, 1218 and 1219) that represent the same regions. The second is that there are 3 Zones (1201, 1203 and 1204) that are each broken down into two distinct DHAs (1211, 1212, 1214, 1215, 1216 and 1217). The information at the Zone level was used to create the peer groups. At the final stage in the cluster analysis, the DHA level geography was added to the existing clusters. The DHAs did not have an impact on the placement of the other health regions into the final peer groups. In an analysis involving the peer groups, only one level of geography in Nova Scotia should be used.
4.3 Collapsing Small Clusters
The results from section 4.2 (specifically Table 4.2.1), represent clusters that are approximately evenly spaced and have minimal within cluster variance given the parameters used by the clustering algorithm. The results in the table show that 12 clusters were formed that range in size from 2 to 35 health regions (excluding DHAs). However, having a cluster with less than five regions is not practical as it does not provide many options for comparison. In order to provide more peers for comparison, clusters with less than five members were combined with their nearest neighbour. The exception is cluster G (Montréal, Toronto and Vancouver). Cluster G was not combined with another cluster since these health regions tend to be very different than other regions across the country.
There were three clusters that were joined with their closest neighbour. Two of these clusters were joined together; Cluster F (health regions 2417, 2418 and 6201) was combined with its nearest neighbour cluster K (health regions 4685 and 4714). The collapsing of clusters F and K produced a cluster with five health regions, so no additional collapsing was required. This combined cluster was labelled Cluster F. As well, Cluster L (health regions 1014, 2409 and 2410) was combined with cluster H (health regions 3549, 4660, 4670, 4709, 4710, 5951, and 5952). This combined cluster was labelled Cluster H. The result of collapsing the smaller clusters was that the 12 peer groups produced from the final cluster analysis using the FASTCLUS procedure and presented in Table 4.2.1 were reduced to 10 groups. Summary statistics of the final 10 peer groups can be found in Appendix B (excluding the DHAs), a descriptive summary can be found in Appendix C and a list of health regions in each final peer group can be found in Appendix D.
5.1 Principal Components
Principal component analysis is a multivariate technique which aims to reduce the number of variables in the data to a few factors called principal components. Principal components are linear combinations of the original variables and are uncorrelated. They are derived in decreasing order of importance, so that as much of the total variance in the data can be explained in as few factors as possible. Therefore, the first principal component is the most important factor since it explains the largest proportion of the total variance in the data.
A principal component analysis was performed on the 24 socio-economic variables used in the cluster analysis. The first two principal components accounted for just over 53% of the total variability. The first principal component appears to be measures of urbanicity (housing affordability, proportion of immigrants, average dwelling value, proportion of post-secondary graduates, proportion of visible minorities, etc.). The second principal component seems to be measures of income inequality (median household income, proportion of low income children, proportion of low income persons in private households, proportion of all income that came from government transfers, etc.). This result is similar to the previous peer group classifications, which indicates that the variables which drive the analysis are remaining fairly consistent over time. The first six principal components accounted for over 86% of the total variability in the data, showing that 24 variables can be reduced to six factors without losing much information.
5.2 Strongest Predictors
In order to determine which variables played a key role in defining the health region peer groups, the final clusters were run against all 24 variables in a stepwise discriminant analysis. Partial R2 statistics for entry and removal were set at 0.15. Any variable which had an R2 value of 0.5 or higher when regressed against a variable already in the model was removed from the analysis. A summary of the results is found in Table 5.2.1.
The strongest predictors of the final peer groups are population density and the percent of the population self-identifying as Aboriginal. No additional variables were removed from the analysis when regressed against population density, whereas the proportion of the population under the age of 20, the proportion of owner occupied dwellings, the proportion of lone-parent families and the proportion of households spending 30% or more of their income on shelter were removed from the analysis when regressed against the percent of population self-identifying as Aboriginal.
5.3 Peer Group Descriptions
The four key variables determined by the stepwise discriminant analysis were used to represent each of the clusters. The mean values of these four variables for each peer group can be found in Appendix B. For each of the four variables, several percentiles were calculated and used to classify the peer groups. Values were classified based on the following ranges.
X > 85th percentile
High: 65th percentile < X ≤ 85th percentile
Medium: 35th percentile < X ≤ 65th percentile
Low: 15th percentile < X ≤ 35th percentile
Very Low:X ≤ 15th percentile
The results from this classification can be found in Table 5.3.1. While the methodology is crude as a descriptive tool, it does help to distinguish one peer groups characteristics from another. As shown in the table below, there are no two peer groups which fall into the same category for all four variables. For example, peer group G (which consists of Toronto, Vancouver and Montréal) is the only group with a very high population density, a very low percent of Aboriginals and a very high percent of immigrants.
The results from this classification were used to derive a written summary of the 10 peer groups based on the four key variables from the discriminant analysis. This summary is presented in Appendix C.
5.4 Geographic Limitations
Each province and territory defines the geographic boundaries for a health region based on administrative preference and these boundary definitions change over time. Health regions can be strictly urban or rural or some combination of the two. There may be considerable variability within health regions in regards to health measures due to the lack of geographic homogeneity and this should be taken into consideration when inferences are being made about a certain region. For instance, even though the health indicators in Vancouver compare favourably with the national averages, this should not be interpreted as meaning that the residents of the downtown core in Vancouver have better than average health2. This lack of homogeneity in defining health region boundaries makes the exercise of assigning health regions to peer groups much more difficult, as it can have a large impact on how well a certain variable represents the entire region and in some cases important defining factors may be missed.
It should also be noted that there may be considerable variability amongst the health regions within a peer group in regards to the socio-economic factors used in the cluster analysis. This should be considered when comparing regions within a certain peer group. This variability can be seen for the 2007 peer groups in Appendix B for the four key variables determined by the stepwise discriminant analysis.
5.5 Collapsing Health Regions
There are two instances where the CCHS combines smaller health regions to ensure that sample survey estimates will attain a sufficient coefficient of variation (CV) to be reportable. This occurs in northern Manitoba where health region 4680 (Burntwood Regional Health Authority) and 4690 (Churchill Regional Health Authority) are combined to form 4685 (Burtwood/Churchill). It also occurs in northern Saskatchewan where health region 4711 (Mamawetan Churchill River Regional Health Authority), 4712 (Keewatin Yatthé Regional Health Authority) and 4713 (Athabasca Health Authority) are combined to form 4714 (Mamawetan/Keewatin/Athabasca). The decision was made to use these combined health regions in the creation of the peer groups since the CCHS is one of the principal data sources used in an analysis of health related data by peer groups.
5.6 Geographic Representation of Final Peer Groups
The map below is a good visual representation of the geographic clustering of the health regions into the final 10 peer groups. Montréal, Toronto and Vancouver form the smallest cluster because in terms of the size and the diversity of their populations, they are too different from the other health regions to be combined with any other peer group.
There are some definite clusters of health regions that formed based on their location within Canada. The northern regions have clustered based on the Aboriginal make-up of their communities. A cluster of eastern health regions has formed based on their low employment rates and very low proportion of immigrants. All peer groups have representation across provincial and/or territorial borders.
6. Peer Groups in Action
The purpose of this section is to demonstrate the usefulness of the peer groups. There are two valuable, yet different, analyses possible with the peer groups; health-related indicators can be compared between and within peer groups. Since peer groups are formed based on regions that have similar socio-economic characteristics, it is expected that differences between peer groups will arise. Peer groups with better socio-economic status indicators are likely to have better health status measures. Estimates of a single peer group can also be compared with national averages in order to ascertain how well the group of regions fairs as a whole. The second analysis possible, one that may be of more relevant importance, is the comparison of health regions within a peer group. Once the effects of the various social and economic characteristics known to influence health status have been removed, a more useful comparison of regions by measures of health status is possible.
The example illustrated in Section 6.1 is a simple demonstration of how and when the peer groups can be used. The example uses the 2007 CCHS data and the 2003 Peer Group Classification, which is the variable available on the data file. Note that the 2003 Peer Group Classification resulted in nine peer groups. A similar analysis will be possible with the 2008 CCHS data and the 2007 Peer Group Classification once the data is released in June of 2009. A more detailed analysis involving the peer groups can be found in the paper 'The Health of Canada's Communities' written by Margot Shields and Stéphane Tremblay of Statistics Canada (2002).
6.1 Example: Heart Disease
This example focuses on the prevalence of heart disease in the population 18 years of age and over in the different regions across the country. Every CCHS respondent is asked if he/she has heart disease. The national rate of heart disease for the adult population is 5.1%. The missing rate for this health indicator is less than 1% and in this example the missing values have been coded as an absence of the disease.
The rate of heart disease in each peer group is shown in Table 6.1.1, along with the description of the peer group. The prevalence of heart disease in Peer Group B is 1.3% lower than the national average. It is also 3.7% lower than the rate of heart disease in Peer Group I. Both of these differences are significant at the 0.1% level. Peer Group B is composed of mainly urban centers that have a low percentage of government transfer income. This group has the lowest smoking rate (19.2%) and the highest physical activity rate (48.7%) of all nine peer groups. It also has a low heavy drinking rate (20.4%). On the other hand, Peer Group I is composed of mainly rural health regions that have a very high rate of government transfer income. This group has a high smoking rate (28.6%), a high heavy drinking rate (29.1%) and a low physical activity rate (42.2%). The differences in the rates of these three risk factors between Peer Groups B and I are all significant at the 1% level.
Note that when the CV associated with the estimate is between 16.6% and 33.3%, the estimate in the table has an 'E' beside it, which indicates general unrestricted release but is a warning cautioning users of the high sampling variability. When the CV associated with the estimate is above 33.3%, the estimate in the table is replaced by an 'F', which indicates that the estimate can not be released. In this example, the estimate of heart disease in Peer Group F is associated with a high level of variablity. This is partly due to the fact that the CCHS does not generally cover health regions 2417 (Nunavik) and 2418 (Terres-Cries-de-la-Baie-James). The exclusion of two health regions means that there are only three regions available for comparison in this peer group, and all of them are remote regions with resulting small sample sizes.
There are 14 health regions that make up Peer Group B. Table 6.1.2 shows the prevalence of heart disease in each of these regions. There are 13 regions that have a prevalence rate equal to or below the national rate of 5.1%. The highest incidence of heart disease is 5.6% and occurs in health region 3530. This region has a high smoking rate (24.4%) and a low physical activity rate (48.6%). On the other hand, the lowest heart disease rate is 2.3% and occurs in health region 4823. This region has a low smoking rate (18.5%) and a high physical activity rate (55.0%). The difference in the prevalence of heart disease between these two regions is significant at the 1% level.
When it is desired to perform an analysis of a rare event at the health region level, it is often the case that the associated CV does not allow for general release of the information. Table 6.1.2 shows that 13 of the 14 regions in Peer Group B have an estimate of heart disease that is associated with a high sampling variability (CV between 16.6% and 33.3%). The majority of the health regions belonging to this group have a moderately high population density. For other peer groups that contain more remote health regions, it may not be possible to conduct the same analysis due to corresponding CVs above 33.3%. Typically in these cases, the results are published at the province level in order to obtain more sample size and more reliable estimates. The peer groups offer an alternative to the provinces in these types of situations.
As a result of health region boundary changes as of January 2007, and the availability of 2006 Census data, it was necessary to update the 2003 Peer group Classification. In keeping with the original working paper, the goal was to produce a classification which would cluster health regions with similar social and economic health determinants into peer groups. Twenty-four variables covering a wide range of social, economic and demographic areas were used to cluster the health regions.
Health regions were grouped using a non-hierarchical clustering algorithm. Starting with an initial set of 17 clusters, and ensuring that each cluster contained at least two health regions, the results indicated that the regions naturally grouped themselves into 12 distinct peer groups.
Peer groups with fewer than five health regions were combined with their nearest neighbour. This was done to provide enough health regions within a peer group for comparison purposes. The cluster containing Montréal, Toronto and Vancouver was not forced to join another cluster as these health regions tend to have more in common with themselves than with other health regions. The final result was 10 peer groups ranging in size from 3 to 35 (not including the DHAs in Nova Scotia). When mapped, peer groups appeared to form based on their geography. Further, they also appeared to form based on their relative distance to urban centers.
Stepwise discriminant analysis was used to determine which variables had the most influence on the final peer groups. The four most important variables were population density, proportion of Aboriginals, employment rate and proportion of immigrants. Each peer group has at least one distinguishing factor in terms of these four variables.
Peer groups can be useful in an analysis of health-related indicators since once the effects of the various social and economic characteristics known to influence health status have been removed, a more useful comparison of regions is possible. Health indicators can be compared between and within peer groups. As well, the peer groups offer an alternative to the provinces when the results of an analysis can not be presented at the health region level due to insufficient sample size or high sampling variability.
Andberg MR., Cluster Analysis for Applications. New York: Academic Press, 1973.
Chatfield, C. and Collins, A.J., Introduction to Multivariate Analysis. London: Chapman and Hall Ltd., 1983.
Everitt BS., Cluster Analysis, 3rd Edition. Toronto: Halsted Press, 1993.
MacNabb, Larry. "Health Region Peer Groups." Health Indicators (Statistics Canada), May 2002, Catalogue Number 82-221-XIE.
McLeod, Logan. "Health Region Peer Groups 2003." Health Indicators (Statistics Canada), June 2004, Catalogue Number 82-221-XIE
SAS Institute Inc., SAS OnilneDoc®, Version 9, Cary, NC: SAS Institute Inc. 2003.
Shields, M and Tremblay, S. "The Health of Canada's Communities." Health Reports (Statistics Canada), 2002, Catalogue Number 82-003-XIE.
1. Note that the terms peer group and cluster are used interchangeably to refer to the classification of health regions into groups with similar socio-economic characteristics.
2. Shields, M and Tremblay, S. "The Health of Canada's Communities." Health Reports (Statistics Canada), 2002, Catalogue Number 82-003-X20021016330.