Social media as a data source for official statistics; the Dutch Consumer Confidence Index
Section 5. Discussion

Table of contents

For decades, national statistical institutes relied on probability sampling in the production of official statistics. This approach is based on a sound theory to draw valid statistical inference for large finite target populations based on relatively small random samples. Over the last decades, more and more alternative data sources, such as administrative and big data, have become available and the question is raised how to use these data sources in the production of official statistics. An important question is how results obtained with these sources can be generalized to an intended finite target population. Since the data generating process is generally unknown, it is not obvious how to draw valid inference with such data sources.

In this paper, the question is addressed how administrative and big data sources can be used in the production of official statistics. In the most extreme approach, survey data are replaced by related alternative data sources, running the risk of introducing e.g., selection bias. Since most surveys are conducted repeatedly, a time series modelling approach is proposed to investigate to which extent related alternative data sources reflect a similar evolution compared to the series obtained with a repeated survey. With a multivariate state space model, the correlation between the underlying unobserved components of both series can be modelled. In the case that components of the time series model are cointegrated, there are strong indications that both data sources are driven by the same underlying factor. This could be used as an argument that an alternative source can replace existing surveys since they reflect the same evolution of a process, generally at a different level.

The theory underlying probability sampling for finite population inference is stronger than reliance on the concept of cointegration. Series obtained from social media or Google Trends are selected by maximizing the correlation with the series from the sample survey and does not necessarily measure the same concept as the survey. There is no guarantee that this correlation is based on true causality and that the correlation will remain to exist in the future. Sampling theory, in contrast, provides a rigid mathematical theory showing that under a correct sampling strategy, i.e., the right combination of a probability sample with an approximately design-unbiased estimator, results in valid statistical inference for intended target populations.

Even in the case of cointegrated series, an extensive model evaluation, e.g., by some form of cross validation, will be required to assure that the alternative data source is a valid replacement. See in this context also Eichler (2013) for a discussion about the use of Granger causality for causal inference in multiple time series data. Instead of replacing a periodic survey for related data sources, they can be used in a multivariate time series modelling approach as an auxiliary series to improve the precision of the direct estimates or period-to-period change of the direct estimates obtained with a periodic survey. Another important benefit with big data sources is to use the higher frequency of these data sources to make more precise early predictions or nowcasts if in real time the survey estimate is not yet available but the covariate is already available. The time series model applied in this paper, initially proposed by Harvey and Chung (2000), is a generic approach for a model-based estimation procedure for periodic surveys. There are of course also issues with survey sampling. For example, continuously declining response rates and data collection modes that does not reach the intended target population result in selection bias either. In this case, cointegration with a related series derived from social media might be indication that there are similarities between the selection bias in the non-probabilistic big data sources and the non-response selection and coverage bias in a survey sample as pointed out by Baker et al. (2013).

In the application to the CCI, the time series modelling approach does not decrease the variance of the direct estimator if it is used for making level estimates. The reason is that the standard error of the time series model reflects the sampling error and the white noise of the population parameter. The standard error of the direct estimator only reflects the sampling error. In the case of the CCI, the variance component of the white noise of the population parameter is as large as the variance of the sampling error. The state space approach is still useful for producing official figures of the CCI, since it filters a more stable trend of the respondents opinion about the economic climate from the observed series of direct estimates. The situation, however, becomes different if the time series model is used to estimate month-to-month change. The stable trend estimates are the result of a strong positive correlation between the trend estimates between subsequent periods. As a result the standard errors of month-to-month change obtained with the time series model are clearly smaller than those of the direct estimates. Standard errors of smoothed month-to-month changes are about 47% smaller than those of the direct estimates. Standard errors of the filtered estimates are about 17% smaller than the standard errors of the direct estimates.

Using the SMI as an auxiliary series in a bivariate state space model slightly reduces the standard error of the model estimates of the CCI. However, since the available series of the SMI is relative short, the reduction obtained with this auxiliary series does not outweigh the loss of information in the CCI series that is observed in the period before the SMI became available. However, since both series reflect a similar evolution and social media is rapidly available, the SMI proved to be useful as an auxiliary series in the bivariate model to produce more reliable nowcasts for the CCI in real time at the moment that the SMI becomes available but the CCI is not available yet. In this application the SMI reduces the standard errors of the CCI in a nowcasting procedure with about 17%.

The question can be raised whether the SMI in its current operationalization measures the same concept as the CCI attempts and how the full potentials of social media or other big data sources can be used to measure consumer confidence better than the current CCI and SMI. Instead of constructing a social media index by taking the difference between positive and negative classified messages, an SMI could be constructed by looking at the concepts of the questions used for the CCI. If for example consumer confidence is measured by the amount of purchases of expensive goods during the last 12 months, or with the tendency of households to buy expensive goods, social media indices should be constructed that measure internet search for such goods (cars, houses, white goods, etc.) as well as actual purchases of such goods during the previous months. The strong advantage of this approach is that now actual behaviour of households is measured directly, while a survey measures it indirectly inducing more measurement error. This might eventually result in cointegrated series that measure similar concepts and further improves or even replaces the CCI.

Acknowledgements

The authors are grateful to the Associate Editor and the reviewers for careful reading of a former draft of this paper and providing constructive comments, which significantly improved the content of this paper. The views expressed in this paper are those of the authors and do not necessarily reflect the policies of Statistics Netherlands.

Appendix A

Model diagnostics

Table A.1
Univariate model (3.8) for CCI 172-24 obs
Table summary
This table displays the results of Univariate model (3.8) for CCI 172-24 obs. The information is grouped by Diagnostic (appearing as row headers), Value, (equation)-value and 95% conf. int. (appearing as column headers).
Diagnostic	Value	$p -$ value	95% conf. int.
Diagnostic	Value	$p -$ value	L	U
Log-likelihood	-464	This is an empty cell	This is an empty cell	This is an empty cell
Mean std. innovations	0.0152	This is an empty cell	This is an empty cell	This is an empty cell
Variance std. innovations	1.0851	This is an empty cell	This is an empty cell	This is an empty cell
Skewness std. innovations	0.0276	This is an empty cell	This is an empty cell	This is an empty cell
Kurtosis std. innovations	2.8901	This is an empty cell	This is an empty cell	This is an empty cell
Bowman-Shenton test^{Table A.1 Note 1} on normality in the std. innovations	0.0926	0.955	This is an empty cell	This is an empty cell
Ljung-Box test^{Table A.1 Note 2} on serial correlation in std. innovations	24.108	0.287	This is an empty cell	This is an empty cell
Durban-Watson test^{Table A.1 Note 3} on serial correlation of std. innovations $(T = 148)$	2.082	This is an empty cell	1.68	2.32
$F -$ test^{Table A.1 Note 4} on heteroscedasticity of std. innovations $(d f_{num} = d f_{denom} = 60)$	0.913	This is an empty cell	0.60	1.67
Note 1 Bowman-Shenton statistic: $χ_{2}^{2}$ distribution. Return to note 1 referrer Note 2 Ljung-Box test statistic for serial correlation in the first 24 lags: $χ_{21}^{2}$ distribution. Return to note 2 referrer Note 3 Durban-Watson test statistic approximated with $N (2, 4 / T) .$ Return to note 3 referrer Note 4 $F -$ statistic: $F_{d f_{denom}}^{d f_{num}}$ distribution. Return to note 4 referrer

Table A.2
Bivariate model (3.9) for CCI 57-24 obs
Table summary
This table displays the results of Bivariate model (3.9) for CCI 57-24 obs. The information is grouped by Diagnostic (appearing as row headers), Value, equation value and 95% conf. int. (appearing as column headers).
Diagnostic	Value	$p -$ value	95% conf. int.
Diagnostic	Value	$p -$ value	L	U
Log-likelihood	-230	This is an empty cell	This is an empty cell	This is an empty cell
Mean std. innovations	-0.0872	This is an empty cell	This is an empty cell	This is an empty cell
Variance std. innovations	0.9777	This is an empty cell	This is an empty cell	This is an empty cell
Skewness std. innovations	0.0982	This is an empty cell	This is an empty cell	This is an empty cell
Kurtosis std. innovations	2.5450	This is an empty cell	This is an empty cell	This is an empty cell
Bowman-Shenton test^{Table A.2 Note 1} on normality in the std. innovations	0.3382	0.844	This is an empty cell	This is an empty cell
Ljung-Box test^{Table A.2 Note 2} on serial correlation in std. innovations	18.060	0.645	This is an empty cell	This is an empty cell
Durban-Watson test^{Table A.2 Note 3} on serial correlation of std. innovations $(T = 33)$	2.133	This is an empty cell	1.32	2.68
$F -$ test^{Table A.2 Note 4} on heteroscedasticity of std. innovations $(d f_{num} = d f_{denom} = 15)$	0.783	This is an empty cell	0.35	2.86
Note 1 Bowman-Shenton statistic: $χ_{2}^{2}$ distribution. Return to note 1 referrer Note 2 Ljung-Box test statistic for serial correlation in the first 24 lags: $χ_{21}^{2}$ distribution. Return to note 2 referrer Note 3 Durban-Watson test statistic approximated with $N (2, 4 / T) .$ Return to note 3 referrer Note 4 $F -$ statistic: $F_{d f_{denom}}^{d f_{num}}$ distribution. Return to note 4 referrer

Table A.3
Bivariate model (3.9) for SMI 57 -12 obs
Table summary
This table displays the results of Bivariate model (3.9) for SMI 57 -12 obs. The information is grouped by Diagnostic (appearing as row headers), Value, equation value and 95% conf. int. (appearing as column headers).
Diagnostic	Value	$p -$ value	95% conf. int.
Diagnostic	Value	$p -$ value	L	U
Log-likelihood	-230	This is an empty cell	This is an empty cell	This is an empty cell
Mean std. innovations	0.0954	This is an empty cell	This is an empty cell	This is an empty cell
Variance std. innovations	1.0437	This is an empty cell	This is an empty cell	This is an empty cell
Skewness std. innovations	-0.1311	This is an empty cell	This is an empty cell	This is an empty cell
Kurtosis std. innovations	2.5331	This is an empty cell	This is an empty cell	This is an empty cell
Bowman-Shenton test^{Table A.3 Note 1} on normality in the std. innovations	0.5377	0.764	This is an empty cell	This is an empty cell
Ljung-Box test^{Table A.3 Note 2} on serial correlation in std. innovations	24.208	0.283	This is an empty cell	This is an empty cell
Durban-Watson test^{Table A.3 Note 3} on serial correlation of std. innovations $(T = 45)$	2.028	This is an empty cell	1.42	2.58
$F -$ test^{Table A.3 Note 4} on heteroscedasticity of std. innovations $(d f_{num} = d f_{denom} = 20)$	0.329	This is an empty cell	0.41	2.46
Note 1 Bowman-Shenton statistic: $χ_{2}^{2}$ distribution. Return to note 1 referrer Note 2 Ljung-Box test statistic for serial correlation in the first 24 lags: $χ_{21}^{2}$ distribution. Return to note 2 referrer Note 3 Durban-Watson test statistic approximated with $N (2, 4 / T) .$ Return to note 3 referrer Note 4 $F -$ statistic: $F_{d f_{denom}}^{d f_{num}}$ distribution. Return to note 4 referrer

References

Baker, R., Brick, J.M., Bates, N.A., Battaglia, M., Couper, M.P., Dever, J.A., Gile, K.J. and Tourangeau, R. (2013). Summary report of the AAPOR task force on non-probability sampling. Journal of Survey Statistics and Methodology, 1, 90-143, first published online September 26, 2013, doi:10.1093/jssam/smt008.

Bell, W.R. (2005). Some considerations of seasonal adjustment variances. Census Bureau. Paper available at https://www.census.gov/ts/papers/jsm2005wrb.pdf.

Bell, W.R., and Hillmer, S.C. (1990). The time series approach to estimation for repeated surveys. Survey Methodology, 16, 2, 195-215. Paper available at http://www.statcan.gc.ca/pub/12-001-x/1990002/article/14535-eng.pdf.

Binder, D.A., and Dick, J.P. (1989). Modelling and estimation for repeated surveys. Survey Methodology, 15, 1, 29-45. Paper available at http://www.statcan.gc.ca/pub/12-001-x/1989001/article/14579-eng.pdf.

Binder, D.A., and Dick, J.P. (1990). A method for the analysis of seasonal ARIMA models. Survey Methodology, 16, 2, 239-253. Paper available at http://www.statcan.gc.ca/pub/12-001-x/1990002/article/14533-eng.pdf.

Blight, B.J.N., and Scott, A.J. (1973). A stochastic model for repeated surveys. Journal of the Royal Statistical Society, Series B, 35, 61-66.

Blumenstock, J., Cadamuro, G. and On, R. (2015). Predicting poverty and wealth from mobile phone metadata. Science, 350, 1073-1076.

Bollineni-Balabay, O., van den Brakel, J.A. and Palm, F. (2015). Multivariate state-space approach to variance reduction in series with level and variance breaks due to sampling redesigns. Accepted for publication in Journal of the Royal Statistical Society, Series A.

Bollineni-Balabay, O., van den Brakel, J.A. and Palm, F. (2017). State space time series modelling of the Dutch Labour Force Survey: Model selection and mean squared errors estimation. Survey Methodology, 43, 1, 41-67. Paper available at http://www.statcan.gc.ca/pub/12-001-x/2017001/article/14819-eng.pdf.

Bowley, A.L. (1926). Measurement of the precision attained in sampling. Bulletin de l’Institut International de Statistique, 22, Supplement to Book 1, 6-62.

Buelens, B., Burger, J. and van den Brakel, J.A. (2015). Predictive inference for non-probability samples: A simulation study. Discussion paper 2015-13, Statistics Netherlands, Heerlen.

Cochran, W. (1977). Sampling Theory. New York: John Wiley & Sons, Inc.

Daas, P., and Puts, M. (2014a). Big data as a source of statistical information. The Survey Statistician, 69, 22-31.

Daas, P., and Puts, M. (2014b). Social media sentiment and consumer confidence. European Central Bank Statistics paper series No. 5, Frankfurt Germany.

Doornik, J.A. (2009). An Object-oriented Matrix Programming Language Ox 6. London: Timberlake Consultants Press.

Durbin, J., and Koopman, S.J. (2012). Time Series Analysis by State Space Methods, Second Edition. Oxford: Oxford University Press.

Eichler, M. (2013). Causal inference with multiple time series: Principles and problems. Philosophical transactions of the Royal Statistical Society A, 371, issue 1997.

Feder, M. (2001). Time series analysis of repeated surveys: The state-space approach. Statistica Neerlandica, 55, 182-199.

Hansen, M.H., and Hurwitz, W.N. (1943). On the theory of sampling from finite populations. Annals of Mathematical Statistics 14, 333-362.

Harvey, A.C. (1989). Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge: Cambridge University Press.

Harvey, A.C., and Chung, C.H. (2000). Estimating the underlying change in unemployment in the UK. Journal of the Royal Statistical Society, Series A, 163, 303-339.

Koopman, S.J. (1997). Exact initial Kalman filtering and smoothing for non-stationary time series models. Journal of the American Statistical Association, 92, 1630-1638.

Koopman, S.J., Shephard, N. and Doornik, J.A. (2008). SsfPack 3.0: Statistical Algorithms for Models in State Space Form, London: Timberlake Consultants Press.

Koopman, S.J., Harvey, A., Shephard, N. and Doornik, J.A. (2009). STAMP 8.2, London: Timberlake Consultants Press.

Lind, J.T. (2005). Repeated surveys and the Kalman filter. Econometrics Journal, 8, 418-427.

Marchetti, S., Giusti, C., Pratesi, M., Salvati, N., Giannotti, F., Perdreschi, D., Rinzivillo, S., Pappalardo, L. and Gabrielli, L. (2015). Small area model-based estimators using big data sources. Journal of Official Statistics, 31, 263-281.

Neyman, J. (1934). On the two different aspects of the representative method: The method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society, 97, 558-625.

Pang, B., and Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2, 1-135.

Pfeffermann, D. (1991). Estimation and seasonal adjustment of population means using data from repeated surveys. Journal of Business & Economic Statistics, 9, 163-175.

Pfeffermann, D., and Burck, L. (1990). Robust small area estimation combining time series and cross-sectional data. Survey Methodology, 16, 2, 217-237. Paper available at http://www.statcan.gc.ca/pub/12-001-x/1990002/article/14534-eng.pdf.

Pfeffermann, D., and Rubin-Bleuer, S. (1993). Robust joint modelling of labour force series of small areas. Survey Methodology, 19, 2, 149-163. Paper available at http://www.statcan.gc.ca/pub/12-001-x/1993002/article/14458-eng.pdf.

Pfeffermann, D., and Sverchkov, M. (2014). Estimation of mean squared error of X-11-ARIMA and other estimators of time series components. Journal of Official Statistics, 30, 811-838.

Pfeffermann, D., and Tiller, R. (2006). Small area estimation with state space models subject to benchmark constraints. Journal of the American Statistical Association, 101, 1387-1397.

Pfeffermann, D., Feder, M. and Signorelli, D. (1998). Estimation of autocorrelations of survey errors with application to trend estimation in small areas. Journal of Business & Economic Statistics, 16, 339-348.

Rao, J.N.K., and Yu, M. (1994). Small area estimation by combining time series and cross-sectional data. Canadian Journal of Statistics, 22, 511-528.

Särndal, C.-E., Swensson, B. and Wretman, J. (1992). Model Assisted Survey Sampling. Springer-Verlag.

Scott, A.J., and Smith, T.M.F. (1974). Analysis of repeated surveys using time series methods. Journal of the American Statistical Association, 69, 674-678.

Scott, A.J., Smith, T.M.F. and Jones, R.G. (1977). The application of time series methods to the analysis of repeated surveys. International Statistical Review/Revue Internationale de Statistique, 45, 13-28.

Tam, S.-M. (1987). Analysis of repeated surveys using a dynamic linear model. International Statistical Review/Revue Internationale de Statistique, 55, 1, 63-73.

Tiller, R.B. (1992). Time series modelling of sample survey data from the U.S. current population survey. Journal of Official Statistics, 8, 149-166.

van den Brakel, J.A., and Krieg, S. (2009). Estimation of the monthly unemployment rate through structural time series modelling in a rotating panel design. Survey Methodology, 35, 2, 177-190. Paper available at http://www.statcan.gc.ca/pub/12-001-x/2009002/article/11040-eng.pdf.

van den Brakel, J.A., and Krieg, S. (2015). Dealing with small sample sizes, rotation group bias and discontinuities in a rotating panel design. Survey Methodology, 41, 2, 267-296. Paper available at http://www.statcan.gc.ca/pub/12-001-x/2015002/article/14231-eng.pdf.

ISSN : 1492-0921

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Use of this publication is governed by the Statistics Canada Open Licence Agreement.

Catalogue No. 12-001-X

Frequency: semi-annual

Ottawa

Date modified:: 2017-12-21

Language selection

Search and menus

Search

Social media as a data source for official statistics; the Dutch Consumer Confidence Index
Section 5. Discussion

Acknowledgements

Appendix A

Model diagnostics

References

Social media as a data source for official statistics; the Dutch Consumer Confidence Index Section 5. Discussion

Acknowledgements

Appendix A

Model diagnostics

References

Editorial policy

Submission of Manuscripts

Note of appreciation

Standards of service to the public

Copyright

Social media as a data source for official statistics; the Dutch Consumer Confidence Index
Section 5. Discussion