All time references are in CEST
Pitfalls in data integration |
|
Session Organisers | Mr Philip Adebahr (Chemnitz University of Technology) Mrs Sandra Jaworeck (Chemnitz University of Technology) |
Time | Tuesday 18 July, 16:00 - 17:00 |
Room | U6-06 |
During the COVID-19 pandemic, researchers carried out a large number of studies and generated the corresponding data. Data collected independently of each other cannot be analyzed together without further ado. Integration of data offers the perspective to connect and further analyze this data generating more insights to better understand what is happening, for example during the COVID-19 crisis. Despite this hope, data integration is connected with many pitfalls by combining different statistical tools like weighting, (multiple) imputations, data fusion, and data harmonization. For harmonization, we combine validity and reliability checks as well as (multiple) equating. Since there has been a lots of discussion on the methods and tools itself, little have we discussed limitations of their interplay.
Let’s draw an example: Equating is based on representative data . How do weighting procedures influence equating? To which extend are weighted harmonized data suitable for further data analysis? How does this influence our results? In this session, we will discuss the pitfalls of combining statistical methods of data integration, their interplay and order of implementation, their influence on our analysis and results, as well as possible solutions.
You need not necessarily deal with COVID-19 to be accepted in this session.
Keywords: harmonization, equating, imputation, weighting, data Fusion, data integration
Mr Manuel Holz (Chemnitz University of Technology) - Presenting Author
Professor Jochen Mayerl (Chemnitz University of Technology)
Professor Peter Kriwy (Chemnitz University of Technology)
The study is part of a project with the aim to investigate the relationship between COVID-19 infection activity, its reflection in social media and the attitudes towards specific pandemic measures in a longitudinal matter. For this purpose, we integrate survey, epidemiological and social media data. Survey data stem from several waves of large representative panel studies; in our case the GESIS Panel, SOEP and MCS. Integrating these data sets enables us to obtain a potentially time variant measure of attitudes (in this case towards face mask wearing). Since item scales and patterns of missing data vary across studies, special attention in the data harmonization process (scale transformation, imputation) is required. Further, we use infection activity data from the European Center for Disease Control (ECDC) and Tweets with relevant pandemic content gathered from a web scraping framework . Matching social media and epidemiological data is less prone to scale complications, considering that matching occurs merely on time period. Yet, defining the time frame (single vs. aggregated time points) has the potential to influence results. Accounting for these challenges of combining these different types of data, we investigate the role of dimensionality of items, definition of time frames and weighting on model effect sizes and data fit.
Dr Ranjit Singh (GESIS - Leibniz Institute for the Social Sciences)
Dr Lydia Repke (GESIS - Leibniz Institute for the Social Sciences) - Presenting Author
Many research and infrastructure projects in the social sciences pool survey data from different sources, measured with different contexts, or fielded in different survey modes. Such projects often have to assess the comparability of different measurements before merging different source variables into a homogenous target variable.
When researchers combine different variables into one, they often overlook one aspect of comparability: reliability. Ideally, different data sources should have comparable reliabilities; that is, they should be subject to comparable levels of random error. Establishing comparable reliabilities is challenging because many survey instruments have only one item, which makes estimating reliabilities more cumbersome than for multi-item instruments in psychometry.
We propose two novel approaches to assessing the comparability of the reliability of two single-item instruments measuring political interest and compare them with test-retest reliability. The first approach relies on the open-access tool Survey Quality Predictor (SQP). Based on a meta-analysis of a large pool of methodological experiments, SQP predicts the measurement quality of instruments for continuous latent variables based on the characteristics of the instrument. This approach offers a cost-effective way for ex-post harmonization projects to assess the comparability of reliabilities.
The second approach is what we term comparative attenuation. To assess if two instruments measure the same concept, it is often desirable to correlate both with other related concepts. We demonstrate that the same setup can be used to infer the relative reliability of the two instruments. If both instruments are conceptually comparable, but one is less reliable than the other, then a very specific pattern of intercorrelations with validation concepts arises.
In an empirical proof-of-principle study, we apply both methods alongside a conventional but more costly test-retest reliability estimation and illustrate this with two different survey questions capturing political interest.
Dr Dimitri Prandner (Johannes Kepler University of Linz | School of Social Sciences) - Presenting Author
Professor Johann Bacher (Johannes Kepler University of Linz | Department of empirical social research)
Testing complex stochastic models via survey data often requires a large set of specific variables to operationalize theoretical constructs. This fact often limits the reuse potential of multi-topic social surveys, as they seldom include all the variables necessary for such an analysis. However, as survey research is challenged by e.g., rising costs, dropping response rates and limited resources to administer high quality studies, scientists are often unable to field research-question specific questionnaires as well.
Accordingly, our paper explores the potential of data fusion, based on multiple imputation, using shared variables from core modules to fuse datasets and donate missing information from one dataset to another.
In theory, the concept of fusing surveys, that feature shared core questions, as well as specialized elements, that are already designed with the idea of fusion, could help in creating data that allows for analysing complex models, going beyond the limitations (e.g. length and the related decision to focus on only a few topics or create a limited overview) found in singular surveys, omitting the problem of missing variables.
Yet two conditions must be met for the fusion to produce useful results.
(1.) The statistical model that is used for data fusion must have a certain explanatory power.
(2.) The assumption of local stochastic independence must hold. The variables Z shared between the datasets explain survey specific variables X of survey 1 and Y of survey 2.
The presentation will discuss different methods to fuse data, the assumptions above and their actual applications as well as limitations. Case studies include e.g. the Austrian presidential elections in 2016, were some relevant data was only present in the Austrian Social Survey (SSÖ) 2016, while additional key-variables were only available in the ESS of the same year.