Data quality, concepts and methodology: Data quality
Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please "contact us" to request a format other than those available.
All data, from whatever source, are subject to error. The Industrial Water Survey is no exception. There are two general categories of error in surveys. The first is sampling error which arises from the fact that a sample or subset of the target population is used to represent the population. The size of sampling error is quantifiable. The second category is referred to as non-sampling error and is not as easily quantified. Non-sampling error refers to all the other kinds of error that arise in surveys. An incomplete or inaccurate list of the general population, a respondent misinterpretation of questions, a provision of erroneous information, a failure to respond and information processing errors are examples of nonsampling errors.
Typically the sampling error is measured by the expected variability of the estimate from the true value, expressed as a percentage of the estimate. This measure is referred to as the coefficient of variation (CV) or the standard deviation. Coefficients of variation of the final estimates were computed for the Industrial Water Survey and are indicated on the statistical tables. The quality of the estimates was classified as follows:
As mentioned in the previous section on “data collection and processing”, every attempt was made to eliminate the non-sampling error through collection and data validation techniques.
The 2011 response rate for the manufacturing component of the Industrial Water Survey was 62%. For the mining component of the survey, it was 65%. The response rate was 90% for the thermal-electric component. The following variables were considered to be mandatory for the survey: the unit of measurement (C0101); total water intake (C1013); total water discharge (C1113); did the establishment treat any intake water (C3001); did the establishment recirculate or reuse water (C5001); did the establishment measure its discharge water (C6001). If a questionnaire was missing any one of these variables it was considered to be a “total non-response”. At the end of the survey estimation stage, the sample was re-weighted to account for the “total non-response” units.
Many factors affect the accuracy of data produced in a survey. For example, respondents may have made errors in interpreting questions, answers may have been incorrectly entered on the questionnaires, and errors may have been introduced during the data capture or tabulation process. Every effort was made to reduce the occurrence of such errors in the survey.
Returned data were first checked using an automated edit-check program (BLAISE) immediately after capture. This first procedure verifies that all mandatory cells have been filled in, that certain values lie within acceptable ranges, that questionnaire flow patterns have been respected, and that totals equal the sum of their components. Collection officers evaluate the edit failures and concentrate follow-up efforts accordingly.
Further data checking is performed by subject matter officers who compare historical data (if available) with returned data to determine if differences between survey cycles are reasonable. If not, collection officers are asked to confirm with respondents their responses. Subject matter officers also research companies (annual reports, web sites, etc.) in an effort to verify information submitted by respondents.
Statistical imputation was used for partial-response records. Five methods of imputations were used for the Industrial Water Survey: Deterministic imputation (only one possible value for the field to impute), historical imputation, imputation by ratio, donor imputation (using a "nearest neighbour" approach to find a valid record that is most similar to the record requiring imputation) and manual imputation. Ratios were calculated and donors were selected for imputation purposes based on the same or closest industry group within specified geographic areas.
The response values for sampled units were multiplied by a sampling weight in order to estimate for the entire population. The sampling weight was calculated using a number of factors, including the probability of the unit being selected in the sample. Finally, the weights were adjusted to account for the uncovered portion and for respondents who could not be contacted or who refused to complete the survey.
Data evaluation and error detection during data collection is an important way to ensure that the final estimates that are produced are of good quality. However, the survey results are evaluated after data collection is over and the estimates have been produced. One way to assess data quality is to compare it to the trends of other data collected. For the Industrial Water Survey, estimates of 2011 were compared with the estimates of the 2009, 2007 and 2005 editions. This historical comparison was made to make sure that the estimates were coherent.