# 5.0 Data accuracy and quality

## Archived Content

Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please "contact us" to request a format other than those available.

## 5.1 Sampling errors

Sampling errors arise from estimating a population characteristic by looking at only one portion of the population rather than the entire population. It refers to the difference between the estimate derived from a sample survey and the 'true' value that would result if a census of the whole population were taken under the same conditions. There are no sampling errors in a census because the calculations are based on the entire population.

## 5.2 Standard error, confidence intervals and coefficient of variation

A common measure of sampling error is the standard error (SE). The standard error measures the degree of variation introduced in estimates by selecting one particular sample rather than another of the same size and design. The standard error may also be used to calculate confidence intervals associated with an estimate (Y).

Confidence intervals (CI) are used to express the precision of the estimate. It has been demonstrated mathematically that, if the sampling were repeated many times, the true population value would lie within the Y +/- 2SE confidence interval 95 times out of 100 and within the narrower confidence interval defined by Y +/- SE, 68 times out of 100.

Another important measure of sampling error is given by the coefficient of variation (CV). The coefficient of variation is the standard error of an estimate, expressed as a ratio or percentage of the estimate (i.e. 100 x SE / Y).

To illustrate the relationship between the standard error, the confidence intervals and the coefficient of variation, let us take the following example. Suppose that the estimated median net worth from a given source is \$10,000, and that its corresponding standard error is \$200. The coefficient of variation is therefore equal to 2%. The 95% confidence interval estimated from this sample ranges from \$9,600 to \$10,400, i.e. \$10,000 +/- \$400. This means that with a 95% degree of confidence, it can be asserted that the median net worth of the target population is between \$9,600 and \$10,400.

Estimates with a coefficient of variation less than 16.6% are considered reliable for general use. Estimates with coefficients of variation between 16.6% and 33.3% should be accompanied by a warning to caution users about the high levels of error. Estimates with coefficients of variation higher than 33.3% are deemed to be unreliable. For estimates of net worth in this survey, CVs greater then 33.3% generally occur when the sample size contributing to an estimate is 25 or less. This affects the level of detail in published tables and, in particular, limits the availability of provincial statistics.

Table 5-1 provides quality level guidelines used at Statistics Canada.

Table 5-2 shows the precision of the SFS estimates. At the Canada level, the estimates are generally reliable. However, users should exercise caution when producing detailed estimates at the regional level. Additional variance estimates can be calculated by Statistics Canada on a cost-recovery basis.

The bootstrap approach, a pseudo-replication technique, is used for the calculation of the coefficients of variation of the estimates presented in table 5-2. Many Statistics Canada surveys use complex sampling designs when selecting their samples. As variance estimation for these sampling schemes cannot be accomplished using simple formulae, we must use approximate methods to estimate variances. Resampling methods, and in particular the bootstrap method, figure among these. The bootstrap approach possesses many interesting properties and is the method employed by many Statistics Canada surveys.

For more information on the bootstrap approach, refer to the Statistics Canada publication (Catalogue 12-002-XIE), The Research Data Centres Information and Technical Bulletin, Fall 2004, vol. 1 no. 2.

## 5.3 Non-sampling errors

Non-sampling errors can be defined as errors arising during the course of all survey activities other than sampling. Unlike sampling errors, they can be present in both sample surveys and censuses.

Non-sampling errors can be classified into two groups: random errors and systematic errors.

• Random errors are the unpredictable errors resulting from estimation. They are generally cancelled out if a large enough sample is used. However, when these errors do take effect, they often lead to an increased variability in the characteristic of interest (i.e., the greater the difference between the population units, the larger the sample size required to achieve a specific level of reliability).

• Systematic errors are those errors that tend to accumulate over the entire sample. For example, if there is an error in the questionnaire design, this could cause problems with the respondent's answers, which in turn, can create processing errors, etc. These types of errors often lead to a bias in the final results.

Non-sampling errors are extremely difficult, if not impossible, to measure. Since random errors have the tendency to be cancelled out, systematic errors are the principal cause for concern. Unlike sampling variance, bias caused by systematic errors cannot be reduced by increasing the sample size.

Non-sampling errors can occur because of problems in coverage, response, non-response, data processing, estimation and analysis.

### 5.3.1 Coverage errors

An error in coverage occurs when there is an omission, duplication or wrongful inclusion of the units in the population or sample. Omissions are referred to as undercoverage, while duplication and wrongful inclusions are called overcoverage. These errors are caused by defects in the survey frame: inaccuracy, incompleteness, duplication, inadequacy and obsolescence. Coverage errors may also occur in field procedures (e.g., a survey is conducted, but the interviewer misses several households or persons).

### 5.3.2 Response errors

Response errors result from data that have been requested, provided, received or recorded incorrectly. The response errors may occur because of inefficiencies with the questionnaire, the interviewer, the respondent or the survey process.

Poor questionnaire design
It is essential that sample survey or census questions are worded carefully in order to avoid introducing bias. If questions are misleading or confusing, then the responses may end up being distorted.

Interview bias
An interviewer can influence how a respondent answers the survey questions. This may occur when the interviewer is too friendly or aloof or prompts the respondent. To prevent this, interviewers must be trained to remain neutral throughout the interview. They must also pay close attention to the way they ask each question. If an interviewer changes the way a question is worded, it may impact the respondent's answer.

Respondent errors
Respondents can also provide incorrect answers. Faulty recollections, tendencies to exaggerate or underplay events, and inclinations to give answers that appear more 'socially desirable' are several reasons why a respondent may provide a false answer.

Problems with the survey process
Errors can also occur because of a problem with the actual survey process. Using proxy responses (taking answers from someone other than the respondent) or lacking control over the survey procedures are just a few ways of increasing the possibility for response errors.

### 5.3.3 Non-response errors

Non-response errors are the result of not having obtained sufficient answers to survey questions. There are two types of non-response errors: complete and partial. The overall response rate for the 2005 Survey of Financial Security was 67.7%.

Complete non-response errors
These errors can occur when the survey fails to measure some of the units in the selected sample. Reasons for this type of error may be that the respondent is unavailable or temporarily absent, the respondent is unable or refuses to participate in the survey, or the dwelling is vacant. If a significant number of people do not respond to a survey, then the results may be biased since the characteristics of the non-respondents may differ from those who have participated.

Partial non-response errors
This type of error deals with incomplete information obtained from the respondent. For certain people, some questions may be difficult to understand. To reduce this form of bias, care should be taken in designing and testing questionnaires. Appropriate edit and imputation strategies will also help minimize this bias.

### 5.3.4 Processing errors

Processing errors sometimes emerge during the preparation of the final data files. For example, errors can occur while data are being coded, captured, edited or imputed. Coder bias is usually a result of poor training or incomplete instructions, variance in coder performance (i.e., tiredness, illness), data entry errors, or machine malfunction (some processing errors are caused by errors in the computer programs). The same thing can be said about captured errors. Sometimes, errors are incorrectly identified during the editing phase. Even when errors are discovered, they can be corrected improperly because of poor imputation procedures. To minimize errors, diagnostic tests are carried out periodically to ensure that expected results have been obtained.

### 5.3.5 Estimation errors

Statistics Canada and other data-collecting agencies devote much effort to designing and monitoring surveys in order to make them as error-free as possible. If an inappropriate estimation method is used, then bias can still be introduced, regardless of how errorless the survey had been before estimation.

### 5.3.6 Analysis error

Analysis errors include any errors that occur when using the wrong analytical tools or when the preliminary results are used instead of the final ones. Errors that occur during the publication of these data results are also considered analysis errors.

## 5.4 Treatment of large values

For any sample, estimates can be affected disproportionately by the presence or absence of extreme values from the population. In an asset and debt survey, a few extreme values are expected in the sample, as valid extreme values do exist in the population. Values outside defined bounds were identified and reviewed in relation to other information reported for that respondent. If the value was judged to be the result of a reporting or processing error, it was adjusted. Otherwise, it was retained.

## 5.5 Impact of sampling and non-sampling errors on SFS estimates

Due to the combined effect of these errors, the quality of net worth data is judged to be lower than the quality of income data. This is largely because records of the current value of assets and the outstanding amount of debt are not as readily available as records of income. For example, respondents with numerous bank accounts and investments may receive several different statements, with different reference periods. Compiling this information can be difficult; most income information, on the other hand, would be available in one document, if the respondent had completed an income tax return for the year in question.

## 5.6 Comparability of data and related sources

It is important to realize that there are no other sources for much of the data collected by SFS. Of the variables that do have sources, comparison is often difficult because of differences in defining concepts, grouping of items, and how these items are valued.

Direct comparisons with outside sources, such as the National Balance Sheet Accounts (NBSA) of the System of National Accounts (SNA), do yield certain differences. Comparing both of these sources is difficult due to definitional, coverage and treatment differences.

Based on rough comparisons between the NBSA and the SFS, the following general conclusions can be drawn:

1. The SFS appears to underestimate some net worth components, particularly financial assets and consumer debt.
2. The quality of estimates of real assets (e.g., owner-occupied homes, vehicles) is much better than that of financial assets.

In theory – given similar valuation procedures and groupings – SNA data should be the same as that collected by an asset and debt survey. The SNA collects individual wealth data from institutional sources such as banks and insurance companies, net of corporations and governments. One major problem has been the SNA categorization of individuals and unincorporated business. Because the individual data and the unincorporated business can not be separated out, these estimates will always be higher than the survey estimates alone.

The Census and other surveys are important sources for ensuring that the SFS sample is representative of the Canadian population. Despite conceptual differences with the SNA estimates, ensuring a representative sample is extremely important to the validity of the data. It was determined that with respect to characteristics such as sex, age, marital status, education that the 2005 SFS data was very comparable to data from the 2001 Census. SFS estimates for pension variables such as membership and contributions were found to be very close to data produced by Statistics Canada’s Pension Plans in Canada Survey.

## 5.7 Response rates

The overall response rate for the 2005 Survey of Financial Security was 67.7%. Table 5-3 gives a breakdown of response rates by province for the area sample and the high income sample.