All time references are in CEST
alternative data sources and pseudo-inference in official statistics |
|
Session Organisers | Dr Federico Crescenzi (University of Tuscia) Professor Tiziana Laureti (University of Tuscia) |
Time | Tuesday 18 July, 09:00 - 10:30 |
Room |
Official statistics are released based on official survey data collected by National Statistical Offices that often involve probabilistic samples. However, the process of releasing statistics on socio-economic phenomena such as poverty and living conditions or price statistics (e.g. inflation) may suffer from scarse timeliness, or it may be of not enough accuracy to provide reliable estimates for small areas. In the field of measuring inflation, for example, alternative data sources such as data from web-scraping have proven to be valuable tools to provide high frequency estimates of the inflation. In recent years, web-scraped data has become an important source for the compilation of consumer price indices (CPIs) in many countries, driven by the growing prevalence of online sales and advances in the methods used to process such data (European Commission, 2020). On the other hand, remote sensing data have provided evidences in predicting poverty at small areas.
Nonetheless, web-scraping and other similar sources of data are collected based on non-probabilistic surveys, so that they can suffer from selection-biases and lack of consistent estimation of sampling errors. Moreover, this data opens the possibility to more nuances statistical modelling techniques based on statistical/machine learning.
The goal of this session is to show the advantages that alternative data sources can offer in the field of official statistics and to address the drawbacks that come from non probabilistic samples. The session welcomes contributions of the kind
- small area estimation of official statistics based on alternative data sources (e.g. remote sensing, web-scraping)
- web scraping and crowdsourcing data to release/integrate official statistics
- non-probabilistic inference in official statistics
- survey to survey imputation techniques using alternative data sources and machine learning techniques
- statistical/machine learning techniques in official statistics
Professor Pier Luigi Conti (Sapienza University of Rome)
Professor Daniela Marella (Sapienza University of Rome) - Presenting Author
Accounting for nonignorable selection mechanism in non-probability samples is a major undertaking, and the present paper attempts at addressing this challenge. In probability sampling every unit in a finite population has a known non-zero probability of being selected. On the contrary, non-probability sampling involves some form of arbitrary selection of units into the sample, and, as a matter of fact, inclusion probabilities are unknown. Hence, it is not possible to apply the standard design-based approach to make inference about the finite population parameters. Furthermore, the unknown selection mechanism is frequently selective with respect to the target population, so estimates of population characteristics may be subject to serious selection bias.
As a major effect of the uncontrolled selection mechanism, unless very restrictive assumptions are made, the distribution of the character of interest is unidentifiable. The present paper proposes a new approach based on the analysis of the uncertainty due to the non-identifiability of the distribution of the variate of interest Y. Suppose that we have two samples A and B, where B is a large non-probability sample and A is an independent probability sample. First of all, the class of plausible distributions for the variable of interest is defined on the basis of the available sample data. Next, an uncertainty measure quantifying the size of the aforementioned class is introduced, an estimator is defined and its asymptotic properties are analyzed.
The availability of extra-sample information on the sampling design and/or on the distribution of the variable of interest, when available, has, as an effect, the reduction of such an uncertainty. In both cases, uncertainty measure estimators are defined and their asymptotic properties are discussed. Finally, a plausible estimate for the distribution of Y is constructed and its accuracy is evaluated.
Dr Luigi Palumbo (Bank of Italy)
Dr Niccolò Salvini (Università Cattolica del Sacro Cuore)
Dr Tiziana Laureti (Università della Tuscia) - Presenting Author
This study introduces a novel approach to analyzing food price inflation by utilizing web-scraped data from major Italian retail chains to examine the phenomenon of 'cheapflation', where the prices of lower-cost food products (LCFP) increase more rapidly than those of more expensive alternatives across both temporal and spatial dimensions in the Italian food market. The analysis is framed within a stochastic price index approach, employing the Time-interaction-Region Product Dummy (TiRPD) models to examine price variations over time and space.
Our findings confirm the occurrence of "cheapflation" during the analysis period, highlighting the potential of web-scraped data to enhance traditional Consumer Price Index methodologies and provide real-time insights into inflationary pressures. This innovative use of big data provides valuable evidence for policy makers seeking to address economic disparities and the impacts of inflation on vulnerable populations.
Dr Camilla Salvatore (Utrecht University) - Presenting Author
Dr Angelo Moretti (Utrecht University)
Climate change is a global problem that has a significant impact on the world’s economy and society. To effectively address climate change, policymakers require reliable estimates of relevant indicators measuring attitudes towards climate change at a sub-national level, given that these vary at a geographical level. Measuring public attitudes towards climate change is crucial in order to investigate the collective action towards sustainable practices. However, nationally representative sample surveys collecting variables around these phenomena, e.g., the European Social Survey (ESS), are not usually designed for producing accurate and precise estimates at sub-national level. In this work, we propose to use small area estimation techniques to obtain reliable estimates of attitudes towards climate change at regional level based on the ESS. The key idea of small area estimation models is to “borrow strength” from the other areas and auxiliary information based on administrative data or the Census, to improve the survey-based estimates. In recent years, the integration of digital trace data (e.g. from websites, social media, google trends) with survey data has gained importance. A novel aspect of our approach is that we include non-traditional auxiliary information, specifically web data, into our model. Our results demonstrate that incorporating web data, in some cases, yields more reliable estimates than the model without them. The results are assessed and discussed via model selection and diagnostics. Finally, we also acknowledge and address certain limitations associated with the use of web data in small area estimation.
Professor Nina Deliu (University of Cambridge) - Presenting Author
Professor Brunero Liseo (Sapienza University of Rome)
Survey sampling and, more generally, Official Statistics are experiencing an important
renovation time. On one hand, there is the need to exploit the huge information potentiality
that the digital revolution has made available in terms of data. On the other hand, this
process occurred simultaneously with a progressive deterioration of the quality of classical
sample surveys, due to a decreasing willingness to participate and an increasing rate of
missing responses. The switch from survey-based inference to a hybrid system involving
register-based information has made more stringent the debate and the possible resolution
of the design-based versus model-based approaches controversy.
Among the main consequences of this paradigm shift, there is a need for robust methods to
quantify the uncertainty in the data production paradigm. In this new framework, the need of
statistical techniques which provide exact coverage guarantees to model-based procedures
is essential. However, common uncertainty quantification methods have a coverage
guarantee that is restricted to large samples, leaving the problem open for the production of
official predictions in cases with small sizes such as small domains or small-area estimation.
In this work, we explore and discuss a general method that guarantees the exact coverage
of prediction intervals even for finite sample sizes, the so-called Conformal Prediction (CP)
framework. CP is a relatively new method for the construction of prediction sets with
guaranteed error rates, exclusively under the exchangeability assumption for the
observations in the sample. While its use is very popular in modern statistics, CP is yet not
very common in survey statistics. We argue that this approach can be beneficial both in
design-based and model-based approaches to survey sampling.