Support Activities

Warning View the most recent version.

Archived Content

Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please "contact us" to request a format other than those available.

Skip to text

Text begins

For more information on the program as a whole, contact:
Mike Hidiroglou (613 951-0251, mike.hidiroglou@statcan.gc.ca).

Time Series

The objective of the time series research is to maintain high-level expertise and offer needed consultation in the area, to develop and maintain tools to apply solutions to real-life time series problems as well as to explore current problems without known or acceptable solutions.

The projects can be split into various sub-topics with emphasis on the following:

  • Consultation in Time Series (including course development);
  • Time Series Processing and Seasonal Adjustment;
  • Support to G-Series (Benchmarking and Reconciliation);
  • Calendarization;
  • Modeling and Forecasting;
  • Trend-Cycle Estimation.

Progress:

Consultation

As part of the Time Series Research and Analysis Centre (TSRAC) mandate, consultation was offered as requested by various clients. Topics most frequently covered in the reviewed period were related to benchmarking, modeling, forecasting and the use of seasonal adjustment tools for the purpose of data validation and quality assurance.

TSRAC members continue their participation to various analytical and dissemination groups such as the Forum for the Daily analysts and the new Forum on seasonal adjustment and economic signals. TSRAC members also met with various international visitors to discuss time series issues and refereed papers for external journals.

Time Series Processing and Seasonal Adjustment

This project monitors high-level activities related to the support and development of the Time Series Processing System. Seasonal adjustment is done using X-12-ARIMA (for analysis and development or production) or SAS Proc X12 (for production).

Time Series Processing System was updated to include more diagnostics on the risks of seasonal outliers and extreme. A special workshop on time series outliers was presented (Matthews 2014).

A protocol for intervention in seasonal adjustment was developed (Fortier and Matthews 2015).

Seasonal adjustment courses were updated to incorporated more exercises and tailored to a specific group of users interested in seasonal indicators series for calendarization purposes.

Support to G-Series (Benchmarking and Reconciliation)

This project entails the support and development of G-Series 1.04.001, which includes PROC BENCHMARKING and PROC TSRAKING, two SAS procedures. A new balancing functionality is in development for G-Series 2.01.001. The balancing procedure will satisfy generalized linear constraints and non negativity constraints. It will allow processing of more complex reconciliation rules, including the treatment of processing groups related to temporal constraints.

Specifications for the new procedure were written (Bérubé, Ferland and Fortier 2014). A prototype module using SAS/OR (proc OPTMODEL) was developed. The release of the final solution is scheduled for next year.

Calendarization

Calendarization is the process used to convert data reported over varying time intervals into values that cover calendar periods such as reference months and reference years. It can be performed using benchmarking methods, spline interpolation techniques or in the context of state space modeling (Quenneville, Picard and Fortier 2013). In the recent months, the links between the various methods (including the relationship between the Hodrick Prescott filter, the spline and the state space models) were further established (Picard 2014).

Modelling and Forecasting

The new software SAS Forecast Studio (SAS/HPF) was received, installed and tested. The properties of the software were documented (Zhang and Picard 2015). Accessing input and output files in this new environment is not very intuitive. After trial and error, the process was understood and a how-to guide was written to facilitate the learning curve of future users. Quality indicators were developed to help the analyst interpret the results of the forecasting exercises in the context of identifying break in time series (Oyarzun, Matthews and Picard 2014).

Trend-Cycle Estimation

A new project was started to review the possibility of reintroducing trend lines in our official publications (especially in The Daily, Statistics Canada's official release bulletin). In the previous periods, various methods were reviewed, first in the literature and then in a simulation study. The results—which favor the use of a trend-cycle line in our published graphs using a variant of the Dagum and Luarti (2009)—were presented to various internal committees. An implementation strategy and a communication plan for dissemination were developed. The communication plan includes the release of a frequently asked questions document for trend-cycle estimation (Fortier and Matthews 2014).

For further information, contact:
Susie Fortier (613-220-1948, susie.fortier@statcan.gc.ca).

References

Dagum, E.B., and Luati, A. (2009). A cascade linear filter to reduce revisions and false turning points for real time trend-cycle estimation. Econometric Reviews, vol. 28, issue 1-3, 40-59.

Quenneville, B., Picard, F. and Fortier, S. (2013). Calendarisation with interpolating spline and state space models. Journal of the Royale Statistical Society (Series C - Applied Statistics), 62-part 3.

Record Linkage Resource Centre (RLRC)

The objectives of the Record Linkage Resource Centre (RLRC) are to provide consulting services to both internal and external users of record linkage methods, including recommendations on software and methodology and collaborative work on record linkage applications. Our mandate is to evaluate alternative record linkage methods and software packages for record linkage and, where necessary, develop prototype versions of software incorporating methods not available in existing packages. We also assist in the dissemination of information concerning record linkage methods, software and applications to interested persons both within and outside Statistics Canada.

Progress:

We continued to provide support work to the development team of G-Link, including participation in Methodology-System Engineering Division (SED) Record Linkage Working Group meetings. RLRC team met with SED bi-weekly and tracked the minutes which could be deemed as potential source of current/past fixes/bugs/improvements of G-Link. RLRC also provided internal and external G-Link users with support when help/comments/suggestions regarding G-Link were sought at G-Link_info through JIRA tickets.

RLRC has provided a fix for WINLKER SAS macro for G-LINK and has revised the non-linked pair creation algorithm of G-LINK to gain more run-time efficiency and enable giant linkages by avoiding hash-based processing. Our record linkages with health data helped us document performance and batch issues to management and developers. We worked on Social Data Linkage Environment (SDLE) linkages and use the linkages as an opportunity to field test new G-Link features and develop more systematic and theoretically coherent approaches to defining and adjusting record linkages under SAS Grid.

We tested the code sent by US Bureau (Bill Winkler) for the standardisation of individual and business names. We briefly reviewed the module for the standardisation of business names developed by the private company WANTED. This work will be completed by Business Survey Methods Division (BSMD).

Several manuscripts were read and a discussion with Australian researchers on development of the E-M algorithm under certain forms of dependence took place during the period, along with a discussion on the applied survey approach to probabilistic linkage.

RLRC will prepare the G-LINK 3.2 tutorial before the release of G-LINK 3.2.

The list of record linkages carried out in the Methodology Branch was updated in 2014 and the results presented.

For further information, contact:
Abdelnasser Saïdi (613-863-7863, abdelnasser.saidi@statcan.gc.ca).

Data Analysis Resource Centre (DARC) Support

The Data Analysis Resource Centre (DARC) is a centre of data analysis expertise within Statistics Canada whose purpose is to encourage, suggest and provide good analytical methods and tools for use with data of all types. DARC provides services to Statistics Canada employees (analysts and methodologists) as well as to analysts from Research Data Centres (RDC consultations are not funded through 3,504 and therefore not discussed in this report) and on occasion to researchers and external clients. The Centre investigates methodological problems that have been identified by analysts, methodologists or external clients. The Centre is also involved in the transfer and exchange of knowledge and experience through reviewing and publishing technical papers and by giving seminars, presentations and courses.

The Centre has been active in numerous consultations with internal analysts, external analysts and methodologists. Internally, we have consulted with staff from the Social and Aboriginal Studies Division, Health Statistics Division, Social Analysis and Modelling Division, Economic Analysis Division, Health Analysis Division, Investment, Science and Technology Division and Manufacturing Wholesale and Trade Division, Retail and Service Industries Division, Centre for Special Business Projects, and the Demography Division. These various consultations covered topics on the use of weights within a linked cohort study, how linkage affects analysis, how to do an impact assessment study, how to deal with missing data in analysis, econometric modelling, graphing questionnaires and big data analysis.

In addition to internal analytical consults, we have had external consultations with analysts at Canadian Revenue Agency, the Telfer School of Management, University of Ottawa, United Way Toronto and researchers at Queens University, these consultations concerned comparing estimators, understanding design based sampling, using weights in pair-wise estimation and latent variable analysis.

Additionally, our client base has continued to include methodologists. We have had consultations with the International Co-operation Division, Business Survey Methods, Household Survey Methods and Social Survey Methods. These consultations included questions on propensity score analysis for mode effect studies, the design and analysis of mode effect studies, bootstrap, latent models, confidence intervals, power and sample size, understanding effect size, principle components, factor analysis, outlier detection, age-standardization and multiple comparisons.

In addition, we have reviewed and provided written comments on technical papers for the Statistical Research Innovation Division. This paper was on issues with Confidence Intervals (CIs) for rare events.

Two half-day courses were presented to the new Data Interpretation Workshop. A course on collection methods for methodologists was developed and delivered. Several new communication strategies such as instructional videos have been explored (for RDC clients) that may be a useful tool in the future for frequently asked questions for all staff. A note for users on Marginal Effects was written.

For further information, contact:
Karla Fox (613-851-8556, karla.fox@statcan.gc.ca).

Generalized Systems

The Generalized Systems Teams and Resource Centre continued development of functionality of G-SAM, G-EST, BOGS and G-CONFID for IBSP. The support centers provided ongoing support to users within Methodology, in the rest of Statistics Canada, and to external clients. The Banff support team worked closely with the methodologists supporting the Integrated Business Statistics Program (IBSP) to develop an appropriate edit and imputation strategy. The G-Sam support team also worked closely with IBSP methodologists to ensure that the sample selection process was thoroughly tested and that it ran smoothly in production. The G-Est team worked in conjunction with Methodologists in Business Survey Methods Division (BSMD) to address performance issues in IBSP. The G-Confid support team gathered requirements for new development in the disclosure control software.

G-CONFID – Disclosure Control System

The G-CONFID software is a set of computer programs to control disclosure of confidential information by deleting cells in quantitative data tables tabulated in several dimensions. This software was developed by Statistics Canada and has been available since 2009.

Although the methodology is based on the former CONFID (or GSAPS) system developed in the 1980s, several improvements have been made. The G-CONFID system contains one procedure (Proc Sensitivity) and two critical SAS macros (Suppress and Audit) that correspond to the main CONFID programs. This version offers more flexibility and performance has been greatly improved. To meet the expanding needs of users, G-CONFID is under perpetual development.

Progress:

Consultation and Support

Generalized Systems (GenSys) assisted Environment, Energy and Transportation Statistics Division (EETSD) to obtain suppression patterns for the Refined Petroleum Products and Coke surveys. GenSys provided SAS programming assistance to implement solutions involving carrying forward existing patterns, treating cells revised to a value of zero, identifying spurious sensitive aggregates that result by using absolute values, among others.

GenSys advised the methodologist of the Information Technology Operations Division ITOD exports survey on using G-Confid to check the confidentiality of data involving “Top 10” lists, dependent tables and multidimensional tables.

GenSys provided replies to RXP Services Limited of Australia in response to a series of inquiries concerning the abilities and limitations of G-Confid.

GenSys also provided support with G-Confid to:

  • Manufacturing and Wholesale Trade Division (MWTD) in the analysis of the Sawmills survey.
  • Retail and Service Industries Division (RSID) in the analysis of the Travel Arrangements Survey.
  • Investment, Science and Technology Division (ISTD) in the analysis of the Capital Expenditures Survey.
  • Economic Analysis Division (EAD) to vet the output generated on behalf of external analysts using Canadian Centre for Data Development and Economic Research (CDER) data.
  • Centre for Special Business Projects (CSBP) in support of a number of special business surveys including the New Brunswick Wage Rate Survey.
  • BSMD methodologists in support of various business surveys including the Census of Agriculture.

Development and Evaluation

Development of G-Confid was largely restricted to the initiatives of Investment Proposal (IP) 6-67000-4039. In the course of offering user support GenSys identified potential future improvements to G-Confid. GenSys identified to System Engineering Division (SED) a minor correction to the SAS code of the macro %Suppress, unrelated to the methodology.

GenSys evaluated the functionality of G-Confid with (i) SAS 9.4 and (ii) an updated version of the solver used with the macro %Suppress.

GenSys met monthly with SED representatives to discuss the progress of development, support issues, and also matters affecting client service delivery, e.g., changes to the functional mailbox and web pages.

Documentation and Presentations

GenSys completed and arranged for translations of the functional descriptions describing the Proc Sensitivity and %Suppress modules. Work is ongoing to prepare the functional description.

GenSys presented an overview of confidentiality practices at Statistics Canada to representatives at the Federal-Provincial-Territorial Committee on Mineral Statistics hosted by NRCan.

GenSys updated the PowerPoint slideshow of an overview of G-Confid. BSMD presented the slideshow to representatives of the Bank of Canada. GenSys presented the slideshow to representatives of the Canada Revenue Agency, and included the results of a demonstration using fictitious data.

Members of GenSys and Household Survey Methods Division (HSMD) met with the representatives of Citizenship and Immigration Canada to discuss confidentiality methods and practices at Statistics Canada.

Investment Plan 6-67000-4039

The business objective is to enhance the generalized disclosure control software G-Confid in reaction to corporate risks, especially with respect to oversuppression, coherence, flexibility and performance. New algorithms, methods and modules will be offered to users.

G-Confid oversuppression, coherence, flexibility and performance issues will be addressed by the following enhancements: a) a new algorithm to address the efficiency of the macro "audit"; b) an additive controlled rounding module to protect non-financial numerical data; c) a new algorithm similar to the HiTaS method used in the Tau-Argus software to improve the efficiency of the macro "suppress"; d) a new macro to automate the non-suppression of businesses who have waived the requirement to protect their data; e) a new macro to process sample weights, for the benefit of non-suppression; f) a new macro to automate the protection of data having negative values.

Progress:

GenSys prepared a production schedule of research, development and testing of each of the initiatives identified within the IP over the three year period. The schedule was received and approved by the GenSys steering committee.

GenSys advised SED on the implementation of the Shuttle algorithm, and tested the redesigned %Audit module. This initiative was implemented in the release of G-Confid version 1.05.

GenSys researched methods of the additive controlled rounding of tabular data, prepared specifications, delivered a prototype program in SAS to SED, and advised SED on metadata outputs as well as on the content of the User Guide. This initiative was implemented in the release of G-Confid version 1.05.

GenSys investigated the methods to process full waivers as well as partial waivers. GenSys then prepared and delivered the specifications to SED. Note that only full waivers fall within the scope of the IP and so SED will only develop the functionality to process full waivers.

Work continued to identify the best approaches for G-Confid to process microdata that include negative values. GenSys reviewed the feedback of a discussion document produced in the previous fiscal year, identified the most promising approaches and examined them more thoroughly. GenSys identified the theoretical soundness of the use of absolute values. GenSys prepared a progress report on the work over the fiscal year.

Work continued on the separate Methodology research proposal to establish methods for G-Confid to process complex data with huge numbers of linear constraints. The work is ongoing.

GenSys representatives attended meetings of the IP implementation committee, reported on progress of each initiative, identified potential threats to success and responded to members’ questions. Additionally, GenSys released G-Confid version 1.05.

Development of the Economic Disclosure Control and Dissemination System (EDCDS)

Following the recommendation of the Ad Hoc Task Force on Confidentiality and Dissemination for IBSP Surveys in the last fiscal year, it was decided to implement a centralized system for confidentiality and dissemination outside the framework of the IBSP but compatible with it. To that end, during the fiscal year SED and SISD built the EDCDS in order to meet many of the confidentiality and dissemination requirements of the IBSP surveys. The EDCDS Task Force was struck, with representatives from many survey areas and programs in Fields 5 and 8, as well as GenSys and BSMD representatives, and also the programmers from SED and SISD.

Progress:

GenSys identified the functionality to build into the EDCDS for use by IBSP surveys. GenSys identified to SED the data needs associated with inputs and outputs, recommended workaround solutions to G-Confid limitations that GenSys will overcome as part of Investment Plan 6-67000-4039, and provided further solutions such as looping between dependent tables. GenSys met extensively with the SED and SISD designers, exchanged many ideas via email, identified parameter defaults to use within the system, responded to many question and investigated many scenarios. The extensive participation of GenSys helped to build a system with greater functionality than had been originally envisioned.

GenSys participated in the meetings of the EDCDS Task Force, and used the opportunity to identify best practices to the representatives of the various survey areas and programs. GenSys forged many new ties with the members of the Task Force, and strengthened ties with clients already supported by GenSys in their use of G-Confid.

Course on G-CONFID

In response to demand from clients in subject matter divisions as well as their methodologists, GenSys developed the course H-0435, “Disclosure Control Methods for Tabular Data using G-Confid.”

Progress:

The slides of the course on the use of G-Confid were improved, following recommendations from the report on the pilot version of the course that took place in the previous fiscal year.

The course was taught once in French and once in English. After each offering GenSys prepared a report assessing the offering, recommending changes to the content of slides or handouts, and changes to the exercises.

At the request of EAD, GenSys taught a customized one-day version of the course to EAD employees.

Banff System for Edit and Imputation

The Banff system for edit and imputation was developed to satisfy most of the edit and imputation requirements of economic surveys at Statistics Canada. It was first released in 2002 and operates in a SAS environment. Banff functions on all SAS platforms. The modules in Banff are independent of one another. Banff can be called with programs in SAS, tasks in SAS Enterprise Guide (EG) or metadata and the Banff Processor. Banff replaced its predecessor GEIS which was phased out in 2008.

In addition to regular maintenance of the system, this project requires some support for current and future (potential) applications. The methodologists of the Generalized Systems (GenSys) section act as consultants for other methodologists from various sectors who use Banff, or are considering using Banff. The support provided includes demonstrations of the system to project teams, initial meetings to determine if Banff can be a potential tool for a new or redesigned survey, and ongoing support to answer questions and provide general information. GenSys also maintain documentation such as the tutorial and the functional description in order to promote Banff and assist the users. Furthermore, the GenSys methodologists are involved in the ongoing research and development of new features. They collaborate closely with the System Engineering Division (SED) to make sure the new functionality matches the users’ needs. The methodologists also participate in the testing of the new components of the system.

Progress:

Development and Evaluation

Banff 2.06 was released on May 16th, 2014 to include the Sigma-Gap (SG) method of outlier detection. The accompanying documentation included three examples of SG including a comparison to the Hidiroglou-Berthelot (HB) outlier detection method.

Banff 2.06 certification testing was completed using SAS 9.4 and EG 6.1. SAS 9.4 certification involved submitting many SAS test programs covering the 9 procedures via a macro followed by comparing the logs and outputs to the previous version of Banff 2.06 which used SAS 9.3. Two differences between the versions were printed to a text file and investigated and resolved. EG 6.1 certification involved running the Banff tutorial in EG and comparing the results to those from EG 5.1.

GenSys continued to consult with the IBSP methodologists to learn of how they want the BY groups to work within the Banff processor. Based on this consultation GenSys drafted the specifications to process BY groups according to IBSP’s preferences.

GenSys detailed how Banff minimizes the weighted number of fields requiring imputation using Proc Errorloc. Unlike the approach advocated by Fellegi and Holt (1976) to identify the set of failed edits with the smallest product of the weights, Banff searches for the set with the smallest sum of weights. This discrepancy will be investigated.

Consultation and Support

IBSP (IBSP’s methodologists require the majority of time spent on user support during the past year.)

Shuai Zhang from IBSP found two cases where Proc Prorate in Banff is not always rounding properly in rare cases. The error is at the smallest digit plus or minus 1, so it is not critical. GenSys ascertained that the rounding algorithm was correctly programmed within Banff, and the fault lay with the algorithm itself. GenSys sent specification to SED to improve the rounding algorithm of Proc Prorate with two options: (i) a diagnostic to check that the rounding achieved the desired value, and (ii) the removal of the first of the three steps of the round algorithm.

Shuai also indicated that even when specifying DECIMAL=9, the number of decimals used in the rounding algorithm within Proc Prorate, Banff may not be able to prorate the data. The problem seems to occur when there is at least one value with many significant digits. SED indicated that the problem seems to be inherent to SAS. The investigation is ongoing. Meanwhile, the workaround solution is to round values prior to Prorate.

GenSys and SED continue to work to improve the Banff Processor. This work follows from the points raised by IBSP and Joël Bissonnette in prior periods. One point is from Shuai Zhang who wanted to know if Banff had a bug as Proc Donorimputation used an imputed non-respondent as a donor when the option ELIGDON=Original (only original data can be donated). It is a matter of understanding how the BY groups work in Banff.

Gensys compared the approach of using a HistRatio file versus a CurRatio file. GenSys concluded that the CURRATIO approach uses imputed values for the ratio whereas the HistRatio approach does not, even though ExcludeImputed=’N’ was used for both approaches. Differences can occur when the CountCriteria is set, and when the Hist microdata file has fewer records than the CountCriteria in any given BY group.

GenSys met provided assistance to several BSMD methodologists, including:

  • Lakshman Cooray who sought advice on the use of Banff to impute values in the presence of nonresponse;
  • Samping Chen whom GenSys showed how to relate variables in Proc Prorate; and
  • Jeannine Morabito to whom GenSys explained how Proc DonorImputation searches for a donor starting with the nearest;
  • Serena Lu of BSMD who was assisted with the development of edits for Proc Donorimputation.

GenSys indicated to Predrag Mizdrak that the implied edits that appear in the SAS log are not in a deliverable format for use. GenSys and SED plan to investigate the creation of a data file containing the implied edits, including identifying redundant edits, for the benefit of the user.

Documentation

The Banff Tutorial has been revised. It used to be a project in EG with over one hundred screen captures. The tutorial is now only in SAS for lower maintenance, and the guide is about 60% of its original size. The Tutorial has been greatly improved based on user feedback. There are more details explaining why some values were used for the various parameters. There are new reports produce to analyze the results. Comments were introduced everywhere in the program to better guide the users. The Tutorial is also available for the Banff Processor.

After a comprehensive review of the Proc Outlier test program documentation, a few test programs were deleted due to redundancy with other test programs (same data, parameters, and results), and a few test programs that had previously tested only the HB method were revised to incorporate the SG method. There are presently 75 outlier test programs for Banff 2.06 compared to the 63 used for Banff 2.05.

Further information was included in the functional description, including the ordering of outliers on the OutStatus data file, among others. A new test program was included in the set of tests of Proc Outlier related to the ordering of outliers. GenSys provided SED with a translated point for its Banff User Guide about the order of warnings of dropped observations before outlier detection occurs.

A new test program for Proc Prorate was added which proves that only one prorating edit group is allowed. If there are multiple prorating edits, they must be dependent or nested. GenSys also provided SED with a related update for its User Guide.

Based on a user support issue, we decided to add an Frequently Asked Questions (FAQ) to Banff’s website to explain that, even if the SEED parameter is specified, it is possible that different results could occur from one run of a procedure to the next if the input data set differs for different runs (i.e., either a different number of records or different values for some fields). Identical results will occur only with identical input data.

In response to another client inquiry, an FAQ was added to explain that it is not possible to put a KEEP or DROP statement in a Banff procedure, but the KEEP or DROP options are still possible.

While working on translating the Banff 2.06 tutorial from English to French, in terms of replacing the figures, it was noted that in SAS 9.3, the French warnings were not carried over from SAS 9.2. As a result the French part of an FAQ regarding the length of edits had to be revised to reflect the now English-only warning.

The following four documents have been drafted: 1) Banff Functions and Features sheet, 2) User Guidance on Proc Outlier Methods and Parameter Values, 3) a translated Tutorial (including variable names), and 4) guide on Using SAS Enterprise Guide 5.1 with Banff 2.06.

G-SAM

This project was initially started following work by the generalized systems planning committee. The mandate of the committee was to identify generalized system development needs and to develop a systems renewal plan as part of the 2010 LTP. After completing an inventory of statistical systems and determining the needs of the various surveys, the committee set priorities for development of generalized systems in sampling, estimation and imputation. In the case of sampling, the finding was that the tools available in Generalized Sampling (G-SAM) lack functionality and the framework is outdated; updating them required considerable effort and the result could not achieve the desired flexibility objectives. For this reason, the decision was made to develop new sampling tools focusing on flexibility and ease of use. The new system must offer all of the current G-SAM functionalities as well as those of StatMx (IBSP/PISE) and household surveys.

To facilitate development of the specifications and the code, we chose a modular approach with modules for each of the main functionalities: allocation, stratification, selection and coordination. Each module will consist of a series of SAS macros and will be developed jointly with our SED/DIS partners. We will begin by developing the modules for distribution (allocation), stratification, selection and coordination from the list frames. We will then develop the specifications for the functionalities/modules corresponding to the area frames. Under the global calendar proposed during planning, the list frame modules were to have been delivered (completed and certified) by May 2013; the area frame modules are to be developed once the first version has been delivered.

Progress:

Planning

Meetings with users are ongoing and we are keeping them informed of the development of the new tools. We are attending the Technical Committee of the Integrated Business Statistics Program (IBSP) and actively collaborating on development of the methodology for various sampling-related aspects.

Development

Completed certification of all modules including the module related to the Lavallée-Hidiroglou optimum stratification method.

The first G-SAM v1.0.1 production version has been available since March 30, 2015.

LTP- New Estimation Tools

G-Est is the new generalized estimation system that is intended to be the successor to the current system, Generalized Estimation System (GES), which was developed during the 1990’s. Through a redesign of the architecture, the new system aims to carry over the core functions of GES, add new ones such as those found in StatMx and SEVANI, and facilitate the addition of new modules in the future. The G-Est project commenced following the work of the Generalized Systems Planning Committee. After making an inventory of the existing statistical systems and determining the potential needs of different surveys, development priorities were set. As the IBSP will be the primary user of the first release of G-Est, priorities in the first version of G-Est reflect functionality that they require. In version 2 of G-Est, of which planning and discussions have already commenced, the focus will be on the requirements of household surveys.

G-Est consists of a set of SAS macros that can be combined to perform the necessary computations for a variety of designs and parameters. The design of the new system is intended to be a single application with independent modules that will allow a user to perform various design-based calculations by calling the appropriate modules. In addition to core functions such as parameter and variance estimation of totals, means, sizes, and ratios for one- and two-phase Simple Random Sample without Replacement (SRSWOR) designs, estimation for Poisson designs will also be available. The variance due to imputation system, SEVANI, has been integrated via a wrapper. The development of G-Est is implemented through the collaboration of BSMD and SED. Generalized Systems Methodologists provide specifications of the system to the developers in SED who will produce the final software.

A beta release of G-Est was available in March 2014. Concurrently with the finalizing of version 1 of G-Est, development has begun on version 2, which will focus on the needs of household surveys. New functionality as a result of discussions with HSMD includes performing weight adjustments and deriving bootstrap weights. The project is scheduled to terminate in July 2015.

Progress:

Planning

The G-Est methodologists met regularly with a representative from HSMD to discuss requirements for G-Est version 2. The purpose of these meetings was to clarify details and resolve any issues regarding the implementation of the requirements.

A presentation of the upcoming version 2 features was given to HSMD senior methodologists and chiefs. The purpose of the presentation was to get initial thoughts about the new features and to address any concerns.

Development

G-Est methodologists performed extensive testing of the various versions of G-Est as part of ongoing test and fix cycles. The methodologists worked closely with SED programmers to resolve issues in the application.

G-Est methodologists provided technical support to methodologists using early versions of G-Est. Surveys that used or considered using G-Est include the IBSP, Balance of Payments (BOP), and the T2 Project.

Several maintenance builds of Sevani 3.0 were released. These builds contained various fixes, enhancements, and changes for consistency with Gest_Variance, the wrapper for Sevani.

Development of G-Est 2.0.

Based on the requirements gathered from HSMD, specifications and prototypes were written and provided to SED. These specifications concerned primarily the six new modules to be included in G-Est: nonresponse reweighting, influential weights detection and treatment, quantile estimation, generation of bootstrap weights, bootstrap variance estimation and bootstrap calibration.

Nonresponse reweighting: The nonresponse reweighting module reweights a set of input weights within RHGs based on the statuses of the units. The RHGs are provided by the user and assumed to suit his or her needs. That is, G-Est does not do the nonresponse modeling to derive the RHGs. The RHGs are defined by a list of variables provided by the user. For added flexibility, the module allows certain respondents to be excluded from reweighting. The module also performs reweighting for a set of bootstrap weights.

Influential Weights Detection and Treatment: The influential weights module detects influential weights based on one of four available criteria that a user selects. Multiple variables can be used at once at the detection phase. Treatment of the influential weights is done within subgroups defined by the user. An influential weight is reduced by raising the weight to a fractional power, which is determined via an iterative process or from a list of user-specified exponents. The stopping criterion of the iterations is based on minimizing the mean squared error (MSE) of an estimate, where the MSE is computed using a working set of bootstrap weights or a naïve Taylor approach. Specific units can be excluded from processing so that the weight is not modified.

Quantile Estimation: The quantile estimation module estimates quantiles and computes the corresponding variance and confidence intervals. The module produces survey-weighted estimate of quantiles within domains. Interpolation is used where required. The variance and confidence interval estimation can be done using a choice of two methods: bootstrap and Woodruff methods. Using the bootstrap, the quantile module takes a set of user input bootstrap weights and computes the bootstrap variance. The bootstrap confidence interval is calculated by determining the upper and lower limits of the bootstrap quantile estimates. Using the Woodruff method, a confidence interval (of the image of the quantile) on the y-axis is built first using Bootstrap or Taylor methods and is then projected on the x-axis to obtain a confidence interval for the unknown quantile. This interval is then used to deduce the standard deviation (the half-interval method

Bootstrap weights generation: The bootstrap module generates a user-specified number of bootstrap replicate weights and rescales them according to Rao, Wu and Yue (1992). The module offers parameters for deriving the bootstrap weights for stratified and/or clustered designs. User can also set the offset for the rescaling factor and optionally specify a sampling fraction. A special mode for deriving the bootstrap weights using parallel processing via the SAS grid is expected to be available.

Bootstrap variance estimation: A natural complement to the variance estimation module for G-Est 1.0 is the ability to use the bootstrap weights to estimate the variance for totals, means and ratios. In order to obtain a more efficient implementation the Estimate_Batch module was rewritten entirely. We also want to consider options for running on the grid and computing bootstrap confidence intervals.

Bootstrap calibration: A natural enhancement to the calibration module for G-Est 1.0 is the ability to calibrate the bootstrap weights for the various bootstrap replicates. In order to do so, the existing calibration had to be modified to accept possibly zero and negative entry weights. In addition, to obtain a more efficient implementation for some simple specific cases of great interest, the standard calibration approach that uses numerical optimization algorithms is bypassed altogether and the solution computed by direct means.

In addition to the new modules, several other enhancements were made to the calibration module. Specifications were drawn up for the enhancements and provided to SED. The enhancements include the ability to use the raking ratio distance and the use of auxiliary information at both the cluster and element levels without preprocessing.

After extensive mathematical analysis and evaluation of the existing module, a way to use the existing module to ensure calibrated household weights equal calibrated person weights was determined.

The specifications and the mathematical background of the new modules were presented to SED in two information sessions.

A utility macro was developed to connect the BOGS (Banff Output Generator for Sevani) to G-Est to handle special imputation scenarios. These scenarios are: (i) a variable is imputed, but no estimates of the variable are requested; (ii) a record used as a donor is later imputed; (iii) a record used a donor that is later set to inactive; (iv) a unit is imputed and is later set to inactive; and (v) units used as imputation contributors are later set to inactive causing its imputation class to no longer contain sufficient contributors.

Specs for new features of the BOGS were provided and SED implemented them. The improvements involve creating imputation weights for the imputation steps and ensure that these weights are set to zero when a unit has been excluded from being a contributor (i.e., a potential donor in PROC DONORIMPUTAITON or a unit used to calculate means and betas in PROC ESTIMATOR). This includes taking into account exclusion expressions, rejecting negative values and outliers.

Generalized Systems: automated coding

This project covers all of the support, development, and research activities related to the generalized system for automated coding (G-Code). The work included the methodological support for the system.

The work included research and development into pre-processing methods. This work consists of two parts, studying the impact of the phonetically-based parsing algorithm on the linking results compared to the SOUNDEX and NYSIIS methods currently available in G-Link and on the pre-processing methods currently available in G-Link on the coding results.

The first question addresses the relative performance of a newly built phonetic algorithm APhonex compared to existing algorithms such as Soundex, NYSIIS and Double Metaphone. Indeed, it appears that Aphonex is more accurate than any of those three algorithms but have a slightly lower coding rate.

The second objective was to propose a hybrid method based on detection/correction. This document described a new method using both scoring algorithms available in G-Code version 2. Although the proposed algorithm enable us to code at least 20% more data without decreasing the precision of the coding, it limitation is the growing number of parameters as for one description, we could get up to new more descriptions depending on the number of words to be substituted.

One may also argue that the substitutions do not take into account the context of the description. For example, in future work, it could be helpful to choose the substitutions based on some probability; Bayesian and Machine learning approaches could also be investigated.

Progress:

Generalized Systems: macro suppress

The G-CONFID suppress macro may be unable to converge to find a solution when the confidentiality problem is complex. Previous studies on the score function, sequential approach and Hitas algorithm did not provide satisfactory results. More research is required to find a solution.

A preliminary document describing the various approaches examined previously to resolve the convergence problem has been prepared. It also discusses the problems encountered, advantages and disadvantages. The document is currently being edited by another individual.

The hypercube method defined by Giessing (2003), which is presently in use in the European Tau-Argus system, was examined. An SAS program that reproduces the principles of this method was developed. Initial analyses indicate that a high number of suppressions are generated and that sometimes exact disclosures and cells not fully protected were encountered. Several variants have been added to improve the search process, notably an exhaustive search of all possible cube combinations between tables. A more detailed analysis will be performed in the next fiscal year and a new approach involving replacing incomplete cubes with other cubes will likely be developed.

The Hitas method was re-examined to understand why there were unprotected cells. A small data set was created to conduct this analysis. Following the analysis, an approach was developed and will be tested in the next fiscal year.

For further information, contact:
Karla Fox (613-851-8556, karla.fox@statcan.gc.ca).

References

Giessing, S. (2003). Co-ordination of Cell Suppressions: strategies for use of GHMITER.

Akpoué, B.P., Pelletier, C. and Yeung, A. (2012). Vers une complémentarité entre les algorithmes de Levenshtein et Soundex dans le codage automatisé, Statistics Canada, internal document.

Brouard, F. (2000). SQL Avancé : Programmation et techniques avancées, 2ème édition.

Rao, J.N.K., Wu, C.F.J. and Yue, K. (1992). Some recent work on resampling methods for complex surveys. Survey methodology 18.2, 209-217.

Robert Taft, L. (1970). Name search techniques, New York State Identification and Intelligence System.

Philips, L. (1990). Hanging on the metaphone. Computer Language Magazine, 7, 12.

Statistics Canada (2015). Methodology Guide of G-Code version 2. Internal document.

Quality Assurance

Requests for consultation on quality assurance methods and automated coding and opportunities for applying these methods come from both internal and external clients. In addition to the general consultation a research project was conducted on documenting the use of machine learning methods in official statistical agencies worldwide. The main focus areas of official statistics applications were automatic coding, editing and imputation, and record linkage. This report was used as preparatory reading by participants (senior officials from official statistical agencies) of the Workshop of the High-Level Group (HLG) Modernization Committee on Production and Methods, held in Geneva in April 2015. The abbreviation HLG here stands for the High-Level Group for the Modernization of Statistical Production and Services, which is affiliated with the United Nations.

Progress:

  • There were consultations done on exploring automated coding within Electronic Questionnaire application for the Job Vacancy and Wages Survey.
  • There have been several requests for advice or training in quality assurance and quality control, but these clients have provided their own budget to cover the services provided.
  • The machine learning project is complete. The resulting report was reviewed by the Director of BSMD and the Director General of the Methodology Branch. It was subsequently distributed to the HLG on February 4, 2015. The paper commences with a short explanation of the differences between the two main classes of machine learning methods (supervised machine learning and unsupervised machine learning) and gives examples of classical statistical techniques that belong to these classes. It then gives a list of short descriptions of recent feasibility research projects and actual deployments of machine learning methods at various official statistical agencies, in the areas of automatic coding, editing and imputation, as well as certain other more boutique application areas. The report also explains why conventional supervised machine learning may not be directly applicable to record linkage. The contents of the report are based on input from statistical agencies and an academic literature review.

For further information contact:
Pierre Daoust (613-854-2531, pierre.daoust@statcan.gc.ca).

Internal statistical training

Progress:

The Statistical Training Committee (CFS) coordinates the development and delivery of 27 regularly scheduled courses in survey methods, sampling theory and practice, questionnaire design, time series methods and statistical methods for data analysis. During the year 2014-2015, 37 scheduled sessions (100 days of training) were given, in either English or French. Note that the total number of training days increased by 4% compared to 2013-2014 (96 days of training). A total of 379 participants attended the courses (361 from Statistics Canada and 18 from outside).

The suite of courses continues to expand as the course: “0460: An Introduction to Collection Methods: A Total Survey Error Perspective” was offered for the first time in 2014-2015. A special workshop “H-0406C Strategies and Methods for Statistical Disclosure Control” was offered in March 2015 and may be integrated to the regular list in the future. New courses on “Treatment of influential values in surveys” and on “Understanding Literature Review: Types, Methods and Practice” are currently being developed.

For more information, please contact:
François Gagnon (613-292-4645, francois.gagnon@statcan.gc.ca).

Symposium 2014

The 29th International Methodology Symposium with the theme “Beyond traditional survey taking: adapting to a changing world”, took place on October 29 to 31, 2014 at the Palais des congrès in Gatineau. About 470 people attended the event.

The keynote address was given by Ray Chambers and the Waksberg Award was won by Constance F. Citro.

The organizing committee prepared a program with about 70 papers given by participants from over 15 countries. The committee also looked after the logistics for registration, operations and facilities management. The symposium’s proceedings should be published in fall 2015.

For more information, please contact:
Danielle Lebrasseur (613-854-1141, danielle.lebrasseur@statcan.gc.ca).

Survey Methodology Journal

Survey Methodology is an international journal that publishes articles in both official languages on various aspects of statistical development relevant to a statistical agency. Its editorial board includes world-renowned leaders in survey methods from the government, academic and private sectors. It is one of very few major journals in the world dealing with methodology for official statistics and is available for free at the following URL: www.statcan.gc.ca\SurveyMethodology. The journal is released in fully accessible HTML format and in PDF.

Progress:

In 2014, the journal published its 40th volume. To mark this milestone and as part of a new communication plan, a special entry on Statistics Canada’s researcher’s blog was published. The new communication plan includes various activities to publicise the release of Survey Methodology including a paper in The Daily (Statistics Canada's official release bulletin), e-mails to registered and potential subscribers and mentions on social media. Other activities are in development for the future releases.

The June 2014 issue, SM 40-1, was released on June 27, 2014 and contained 8 papers.

The December 2014 issue, SM 40-2, was released on December 19, 2014. It contains 11 papers including the fourteenth paper in the annual invited series in honour of Joseph Waksberg. The recipient of the 2014 Waksberg Award is Constance Citro.

In 2014, the journal received 52 submissions of papers by various authors.

For further information, contact:
Susie Fortier (613-220-1948, susie.fortier@statcan.gc.ca).

Date modified: