ESRA logo

ESRA 2023 Glance Program


All time references are in CEST

Multi-source data for labour statistics

Session Organisers Professor Roberta Varriale (Sapienza University of Rome)
Professor Dimitris Pavlopoulos (Vrije Universiteit Amsterdam (VU))
TimeTuesday 18 July, 09:00 - 10:30
Room

In recent years, labour market researchers and National Statistical Institutes involved in producing labour market statistics have increasingly adopted the use of multiple data sources. In addition to survey data, representing a traditional and primary source of information, register data from various sources has become extensively utilized in labour market research and, to some extent, in the production of official statistics.
A novel and promising development in this field is the production of multi-source statistics, achieved by linking information from various independent data sources. This approach offers numerous opportunities, as it combines the strengths of different frameworks: the extensive and detailed objective information provided by register data and the substantively valuable subjective insights obtained through surveys.
From a methodological perspective, linking multiple data sources presents the opportunity to address several aspects of data quality. By drawing on multiple sources of information, it is possible to enhance final estimates both in terms of content and accuracy. Nevertheless, this approach also introduces new methodological challenges.
In this session, we aim to present a range of statistical methods for handling multi-source data, accompanied by examples of results obtained using these methods to demonstrate the advantages of multi-source statistics. In particular, we will highlight the use of latent variable models, which leverage the simultaneous availability of information from multiple sources. This approach offers the benefit of accounting for potential measurement errors in each individual source. The application context for these methods is labour statistics.

Keywords: multi-source data, latent variable models, labour statistics

Papers

Modeling Total Error using Multi-source Data: A Simulation and an Application to the Italian Labor Market

Mr Santiago Gómez-Echeverry (Vrije Universiteit Amsterdam) - Presenting Author
Mrs Silvia Loriga (Istituto Nazionale di Statistica (ISTAT))
Mr Davide Di Laurea (Istituto Nazionale di Statistica (ISTAT))
Mr Arnout van Delden (Centraal Bureau voor de Statistiek (CBS))

The expansion of administrative and Big Data and the increase in the survey's non-responses have highlighted the relevance of assessing the quality of non-probability samples. To tackle this issue, people usually resort to the Total Error (TE) framework, which divides the error into a measurement and a representation component. Extensive literature focuses on measurement error, often using a combination of data from different sources to evaluate whether the observed variables adequately capture the concept intended to be measured. Another branch of the literature has centered on the representation error, assessing how respondents are selected in the sample, leading to systematic differences between the population and the observed units. However, research modeling both of these components simultaneously is still scant. In the present study, we address this gap by jointly modeling the measurement and the representation errors, combining recent advances in both areas. We conducted a simulation study to evaluate our TE model under different measurement and representation error specifications. Additionally, we performed a case study analysis using a combination of Italian administrative registers and the Labor Force Survey (LFS) to evaluate the TE in the income variable. Our preliminary results show that our model adequately captures the different error sources and provides a good strategy for assessing the TE when using a combination of probability and non-probability data.


Revealing Tax-benefit Social Preferences in Croatia: Considering the Impact of Direct and Indirect Taxes

Mr Marko Ledic (EIZG) - Presenting Author
Mr Ivica Rubil (EIZG)

This paper employs the inverse-optimal approach from Saez’s (2002) optimal income tax model to derive the implicit marginal social welfare weights of single-earner households in Croatia. We compare tax-benefit revealed social preferences in different settings, depending on whether labor supply elasticities and the inverse-optimal tax model consider only direct taxes or both direct and indirect taxes or the combination of the two. Considering both direct and indirect taxes is crucial, as they can impact choices between leisure and consumption, ultimately influencing the total effective tax burden. We obtained the income distribution from the Croatian component of the European Union Statistics on Income and Living Conditions (EU-SILC) 2018 prepared as the input data for the EU tax-benefit microsimulation model (EUROMOD). Since the EU-SILC does not contain data on expenditures we have matched the data from the Household Budget Survey (HBS) to the EU-SILC using the Predictive Mean Matching imputation method. Net direct and indirect taxes are calculated using EUROMOD and an indirect tax microsimulation model, respectively, while behavioral responses at the intensive and extensive margin are estimated using a static discrete-choice labor supply model for Croatia. As the top incomes tend to be under-represented in survey data, making survey income distributions unrepresentative of the true ones, we have used the tax records data from the Croatian Tax Administration to correct the EU-SILC data. We find that the tax-benefit system in 2017 is shown to be optimal only if the government assigned a much higher welfare weight for the workless poor than the working poor. This holds true when considering only direct taxes and is further strengthened when both direct and indirect taxes are considered.


Using Machine Learning for Improving Nonresponse Adjustment in the Current Population Survey

Dr Emanuel Ben-David (US Census Bureau) - Presenting Author

The response rates to the Current Population Survey (CPS) have declined recently, raising concerns about potential bias in key population labor statistics due to nonresponse. In this paper, we discuss using administrative data to adjust the weights for nonresponse while keeping the calibration of population estimates unchanged. This involves linking the administrative data to responses and non-responding households in surveys. Once linked, we can use this data to adjust the weights for respondents to account for differential nonresponse rates among different subpopulations. In this paper, we propose two main aspects. First, we aim to enhance nonresponse adjustment using more advanced machine learning models. Second, we aim to address potential errors in the linkage process, which can significantly impact the performance of models used for nonresponse adjustments.


Predicting Unit Nonresponse from Multiple Sources of Administrative Data with Parametric Regression and Machine Learning Methods

Dr Hafsteinn Einarsson (University of Iceland) - Presenting Author
Professor Joseph Sakshaug (Ludwig Maximilian University of Munich / University of Mannheim / Institute for Employment Research)

Register-based sample surveys, where data from administrative register systems are drawn to form sample frames, are used in many European countries. However, the range of register-based variables utilized in survey research is often limited to a few demographic characteristics. In recent years, the use of a wider range of register data for research purposes has become more commonplace, although the full potential of the register system remains untapped, particularly for national statistical institutes that can access numerous sources of register-based administrative data. Here, we examine how a greater choice of administrative register data variables can affect survey practice as it relates to labor force surveys, by exploring associations between individual level characteristics drawn from multiple administrative data sources and unit nonresponse in the Icelandic Labor Force Survey, a quarterly cross-sectional telephone survey. Specifically, we focus on whether prior wave information can be utilized to predict unit nonresponse prior to the onset of fieldwork and whether expanding the range of administrative variables improves prediction. Furthermore, we explore whether the choice of estimator affects accuracy by comparing parametric regression and machine learning methods. Our findings suggest that due to the strong association between immigration background and survey participation, most models show similar performance in terms of classifying respondents and nonrespondents. However, when comparing the goodness-of-fit across the full range of the response propensity distribution, we find that the combination of an extensive range of administrative variables and Random Forest models performs best in terms of predicting unit nonresponse. We explore the relative contribution of the predictor variables for ….We consider how these findings could affect survey practice in repeated labor force surveys, including how they may be used in informing adaptive survey designs.


Early Career Patterns: A Comparative Analysis of Education-to-Work Transitions in the Netherlands and Italy

Ms Silvia Loriga (Istat)
Ms Laura Eberlein (VU University Amsterdam) - Presenting Author

Youth employment is a topic of great interest in labor market analysis and is one of the areas where significant differences are observed among European countries. Specifically, Italy has one of the lowest youth employment rates, while the Netherlands boasts one of the highest. In addition to analyzing the proportion of young people employed at a given point in time, it is also interesting to observe transitions, in terms of entry into and exit from employment, as well as changes in the type of work.
This study aims to examine young people's entry into employment and the career paths during the early years following the completion of their studies, comparing the results for Italy and the Netherlands. These aspects are usually studied through ad hoc sample surveys on the career outcomes of young people who have completed their studies. Typically, the information collected from these surveys allows for the analysis of the employment status of young people a certain number of years after leaving the education system. However, it does not enable an analysis of all the transitions that occurred over time.
To carry out this study, we constructed a rich database by linking administrative information with data collected from the Labour Force Survey. The database is characterized by a longitudinal dimension, achieved by leveraging the ability to integrate administrative data over time and the longitudinal nature of the Labour Force Survey. The methodology that we used is a Mixture Hidden Markov Model, which provides a suitable framework for analyzing discrete-time longitudinal data with multiple life statuses and a large number of different transitions. Considering the characteristics of the work, different types of employment are identified and distinct patterns are recognized differing according to their upward and downward mobility.


A multiple-group hidden Markov model for multi-source data. Cross-country differences in employment mobility in the presence of measurement error

Dr Roberta Varriale (Sapienza University )
Dr Mauricio Garnier-Villarreal (Vrije Universiteit Amsterdam)
Dr Dimitris Pavlopoulos (Vrije Universiteit Amsterdam) - Presenting Author
Dr Danila Filipponi (Italian National Institute of Statistics)

In this paper, we study whether measurement error in survey and administrative data biases cross-country differences in employment mobility. For this purpose, we develop a multigroup hidden Markov model and apply it on linked data from the Labour Force Survey and administrative sources from the Netherlands and Italy and for the years 2017-2019. The measurement error correction we apply with our model reconciles differences between data sources and shows that cross-country differences in employment mobility are smaller than originally thought. Error-corrected estimates indicate that mobility from temporary to permanent employment has become, over time, larger in Italy than in the Netherlands, while mobility from non-employment to temporary employment has steadily been higher in the Netherlands than in Italy. The paper illustrates the value of using multiple data sources to produce reliable estimates on key socioeconomic indicators.  


Hidden Markov models: accuracy across structure missing data designs

Dr Mauricio Garnier-Villarreal (Vrije Universiteit Amsterdam)
Dr Roberta Varriale (Sapienza Università di Roma)
Dr Danila Filipponi (Istituto nazionale di statistica) - Presenting Author

Large scale data sets are helpful to generate representative national statistics. But no data set is free from measurement error. Latent variable methods can be use to correct for some forms of measurement error, such as Hidden Markov Models (HMM). A way to do this is by adding multiple indicators, like one from the register and another from a survey. A survey commonly use is the Labour Force Survey (LFS), but this type of survey presents missing data issues. As subjects are not included at every time point, presenting structured missing data design. In this research we test accuracy and stability of HMM with structure missing data like in the LFS with a simulation study. The simulation conditions are inspired by the use of register and LFS data IStat (Istituto nazionale di statistica) for the evaluation of Work status (Employed/Unemployed). We test the simulation across the following data conditions: missing data type (MCAR, and structured), proportion of missing data (from 0 to 0.8), sample size (from 500 to 5000), and item quality. Item quality has four categories, across good/bad items and which item presents the missing data. With this study we will be able to evaluate the predictive accuracy of HMM for categorical variables measured over time with realistic missing data structures. Providing applied researchers with guidelines about the proper use of HMM and when it will tend to present higher classification error