Generating smart deep files: the example of synthesizing hierarchical data
The Government of Canada’s Directive on Open Government aims to ensure that Canadians have greater access to government data and information. One solution for open data is smart synthetic files, which retain as much analytical value as possible and take into account confidentiality issues that arise from collecting personal information. In recent years, Statistics Canada has acquired a recognized expertise in producing synthetic data files of high analytical value. In a current project, Statistics Canada is tackling a new challenge to synthesize a database and preserve hierarchical structures in the form of families, where records are linked and share common traits that must be maintained. These challenges are also encountered when synthesizing structured data such as business data. This paper presents the challenges and solutions for building synthetic data with such hierarchical structures. Application of this strategy will be illustrated with the development of a synthetic database that supports the development of retirement income policies. This database includes over 20 variables and 8 million records structured into approximately 4 million family units. We will present how family structures have been preserved, discuss the practical and technical challenges inherent in developing such a large and complex database, present the risk and utility of the data, and propose avenues for future research.
Keywords: synthetic data of high analytical value; family structures; modern data access solution.
| Format | Release date | More information |
|---|---|---|
| October 29, 2021 |