ESRA logo

ESRA 2023 Glance Program


All time references are in CEST

New Developments in Using, Sharing, and Re-using Metadata

Session Organisers Mr Knut Wenzig (DIW Berlin/SOEP)
Mr Daniel Bela (LIfBi)
Dr Arne Bethmann (SHARE Germany and SHARE Berlin Institute)
TimeTuesday 18 July, 09:00 - 10:30
Room

Metadata systems have evolved from passive documentation tools into active drivers of data management and utilization. This session explores recent advancements that enhance the use, sharing, and re-use of metadata across the data lifecycle, emphasizing innovative methods that improve data quality, interoperability, and efficiency.

With machine-readable metadata, processes like survey instrument generation, data validation, and preparation are increasingly automated, reducing errors and enhancing data-driven decision-making. Metadata systems are becoming essential components in not just documenting data, but actively shaping and streamlining the entire data lifecycle.

We invite papers that highlight:

- Innovative Uses: Examples of how metadata systems are leveraged for automation and optimization in data collection, processing, and analysis.
- Interoperability: Experiences with implementing metadata standards (e.g., DDI, SDMX) to facilitate sharing and re-use across different systems and institutions.
- Collaborative Platforms: Case studies on platforms that support community-driven creation, sharing, and re-use of metadata.
- FAIR Principles: Approaches that ensure metadata adheres to the FAIR (Findable, Accessible, Interoperable, Reusable) principles.
- Future Directions: Emerging technologies, such as AI and machine learning, that could revolutionize metadata use.

This session aims to provide a comprehensive overview of current trends and future directions in metadata management. We seek presentations that not only showcase technological advancements but also discuss the practical challenges and lessons learned in implementing these innovations. By bringing together researchers, data managers, and technologists, this session will foster a rich exchange of ideas on how new developments in metadata can lead to more effective and insightful data management practices.

Keywords: Metadata Management, Interoperability, FAIR Principles, Data Automation, Metadata Standards

Papers

Metadata-Driven Production of Longitudinal Datasets: Introduction and Potential of the Metadata Attribute

Mrs Claudia Saalbach (DIW Berlin/SOEP)
Mrs Jana Nebelin (DIW Berlin/SOEP) - Presenting Author

The production of longitudinal survey datasets poses significant technical and
substantive challenges for data producers. In particular, the integration, analysis, and
documentation of large volumes of data require sophisticated processes and
methodological expertise. From questionnaire development to dataset definition and
the final generation of longitudinal datasets, all steps are increasingly driven by
metadata. At the same time, it is essential to ensure that the extensive data offerings
are user-friendly and efficiently accessible.
At the SOEP (Socio-Economic Panel), approximately 23 survey instruments are
deployed annually, yielding around 41 raw survey datasets and roughly 15
longitudinal datasets, covering nearly 15,000 longitudinal variables from 1984 to the
present. The SOEP’s metadata system plays a pivotal role in this process, offering
various levels of granularity – from study, questionnaire, dataset, and variables to
topics and concepts. A new metadata attribute, the "module," extends this system by
positioning itself conceptually between topics and concepts and structurally between
datasets and variables.
The introduction of the "module" attribute allows for greater flexibility and efficiency in
the generation of longitudinal data products. Instead of processing large datasets
globally, targeted groups of variables – thematically cohesive and methodologically
grounded – can be handled separately. This approach not only enhances the
efficiency and speed of data production but also contributes to improving data
quality.
In our presentation, we provide insights into the practical implementation of
metadata-driven production at SOEP and demonstrate the potential of the "module"
metadata attribute for both data producers and users. Against the backdrop of
related concepts, such as "topic" or "concept" (e.g., at GESIS), we also propose an

initial definition of the term "module" and discuss its added value in the context of
longitudinal data production.


Recent Developments in and Upcoming Endeavors to the NEPS Survey Life Cycle

Mr Daniel Bela (LIfBi / NEPS) - Presenting Author
Mr Simon Dickopf (LIfBi / NEPS)

Whilst long seen as a primary benefit to survey documentation, machine-actionable survey metadata also offer possibilities to implement more efficient and less error-prone survey management procedures, like preparatory test for and conducting the survey field work. We want to showcase the recent developments undertaken for the German National Educational Panel Study (NEPS) making use of these opportunities. The NEPS started its newly recruited Starting Cohort 8, a panel sample of 5th graders, in 2022. For this new cohort study, NEPS implemented a whole new survey infrastructure based on freely available software components. The project’s long-established centralized and structured metadata storage housed at the Leibniz Institute for Educational Trajectories (LIfBi) served as a backbone to re-imagine the panels’ life cycle with modern workflows.

We developed automated procedures for preparing, generating, and processing survey instruments based on their reference metadata. This eliminated the need for (manually produced) programming templates as well as the manual programming of the instruments itself. Instead, the survey environment is automatically deployed to the field hardware via containerized software images. By putting the surveys’ metadata in the center of the infrastructure, we were able to accelerate survey creation and extend its testing. We also ensure the coherence of the stored metadata with the instruments contents and, ultimately, the disseminated data products and documentation.

Future developments aim to extend these workflows beyond the scope of our in-house software environments, so that NEPS’ metadata can be interchanged e.g. with contracted field institutes without manual interaction. Eventually, we aim to document the tools we created publicly, so that other survey infrastructures may build upon them.


The ODISSEI Portal: A Metadata-Only Repository To Enhance Data Reuse in the Dutch Social Sciences

Mr Lucas van der Meer (Erasmus University Rotterdam/ODISSEI)
Dr Kasia Karpinska (Erasmus University Rotterdam/ODISSEI) - Presenting Author
Dr Angelica Maria Maineri (Erasmus University Rotterdam/ODISSEI)
Dr Tom Emery (Erasmus University Rotterdam/ODISSEI)

Despite a growing amount of data that is available for reuse in social research, data discovery is hindered by the multiplicity of registries and data sharing platforms and, consequently, by the variety in standards and terminologies to describe the data and the access conditions. Moreover, this heterogeneity limits opportunities for data linkage, which are often invisible to the users. To solve this fragmentation, the Dutch data infrastructure for the social sciences (ODISSEI) has launched a metadata-only Portal which unlocks access to several data collections.

The ODISSEI Portal combines metadata from a wide variety of research data repositories in the Netherlands into a single interface, allows advanced semantic queries to support findability, and facilitates data access. The ODISSEI Portal is a Dataverse interface which collects metadata from different providers, including Statistics Netherlands (CBS), the Data Archiving and Networked Services (DANS), and the LISS data archive. Metadata from the various providers is harvested (via endpoint or file dumps), harmonised to a common metadata schema, and enriched with multilingual thesauri to support a multilingual search. The enriched metadata is also exported to a knowledge graph, available via an external triple store, to enable complex queries. Moreover, a Data Access Broker (DAB) allows users not only to access open datasets, but also to request access to restricted access data from different providers, all from the Portal interface. To power the DAB, extensive work is being done to harmonise data access conditions and licenses and the way they are expressed in the metadata.

Future plans for the ODISSEI Portal include improving the functionalities of the DAB and allowing connections to Trusted Research Environments for accessing the underlying data and exchanging metadata with the CESSDA catalogue.

The ODISSEI Portal demonstrates how metadata can be leveraged to increase the FAIRness of research data.


What should FAIR Question Banks look like and how do we get there?

Mr Jon Johnson (CLOSER, Social Research Institute, UCL) - Presenting Author
Dr Suparna De (Department of Computer Science, University of Surrey)
Dr Wing Yan Li (Department of Computer Science, University of Surrey)
Dr Chandresh Pravin (Department of Computer Science, University of Surrey)
Mr Paul Bradshaw (Scottish Centre for Social Research (ScotCen))

The development of the CESSDA European Question Bank (https://eqb.cessda.eu/) opens up the possibility of making the core tool for survey research “the question” a referenceable and reusable object to the survey community. For decades, whilst questions have been extensively developed and tested, they have mostly been available within PDFs as adjuncts to the available data.

This has two main consequences, comparison between questions (especially across populations and studies) is onerous and time consuming, and the unavailability of the questions for reuse mitigates against provenance and reproducibility and the development of questionnaire tooling which could utilise this.

The presentation will talk about the challenges of capturing questions and questionnaires and providing them as FAIR objects, what such objects need to contain so that they can be reused from the perspective of 12 years of the CLOSER Discovery project in the UK.

The presentation will also cover, recent advances in extraction of questions from PDF’s into DDI-Lifecycle for interoperability and the limitations and possibilities Machine Learning technologies can offer for question comparison from diverse sources.


CARING: Enhancing Open Data Quality through Community Engagement

Mr Christopher Klamm (University of Cologne)
Mr Ruben Bach (University of Mannheim) - Presenting Author
Mr Tornike Tsereteli (University of Mannheim)

Have you ever found an error in a dataset? Perhaps a misclassified sample or missing metadata about the annotation process? Have you ever wondered how you can help others benefit from your discovery of an error or new information in a dataset that is needed? We are proposing a transparent platform that allows anyone to update datasets, transforming them from static to dynamic resources. This prototype will enhance the sharing and quality assurance of open datasets, addressing challenges posed by evolving and incomplete data. While open-source datasets “sharing” are trending, ensuring their quality is challenging due to the manual validation of millions of data points. We propose a collaborative data quality evaluation concept based on “sharing and caring”. Users can add new samples, metadata, comments, or annotations, fostering continuous improvement and community engagement. Our project lays the groundwork for a platform that encourages contribution and participation from all users, integrating the concept of perspectivism. Recognizing that annotations vary due to annotator perspectives, we will gather diverse labels and metadata to reduce bias and create nuanced datasets, leading to more robust and fair models. This will aid social science research, enabling accurate conclusions that benefit researchers and the community. We will integrate an open-source annotation tool, allowing everyone to enhance datasets by correcting errors and adding new information. This will benefit all researchers by improving dataset quality. We hope to promote a new mindset regarding open data and its quality. Our updates will connect to the original dataset changes. A demonstrator platform will be developed to showcase the “sharing and caring” concept. This web application will enable user interaction with data, encouraging contributions like descriptions, references, and analysis code. Additionally, it will incorporate data versioning with distinct reference IDs for major changes, promoting reproducibility and consisten citation for researchers.


Into the Metadataverse: Metadata-based Survey Data Management

Mrs Lisa Ziemba (Statistics Austria) - Presenting Author

Using metadata to automate data processing and aid data validation is an important new evolvement in survey data management, because it allows for documentation and automation of many repetitive tasks during the data lifecycle. However, the metadata management is a technical and strenuous task, as it requires meticulous record keeping in a standardized way. Especially panel surveys face the challenge of managing changing metadata over the years. So how do we not get lost in this vast metadataverse and rather use the myriad of information as our guide, change management system, documentation tool, and overall encyclopedia to the survey data?
As a first step to achieve this goal, we developed a data processing workflow in R based on one key-value metadata table, which gets updated through the processing and validation workflow for use in the Austrian Socio-Economic Panel (ASEP). This workflow will serve as a metadata-based centralized survey data management, that can be used for many waves to come, while simultaneously keeping track of all changes and being adaptable with existing standards, such as DDI or SDMX.
In the presentation we will show how the survey lifecycle can be made more efficient and transparent by implementing a metadata concept that consists of a simple and maintainable basis, which is easily scalable, adaptable, and expandable. We will discuss the potential outputs, such as a fully automated data documentation and a metadata enriched survey data file for the users.
By sharing our implementation processes in a newly established panel survey, we contribute to advancing the use of metadata in the social sciences for data documentation and management.