ESRA logo

ESRA 2025 Preliminary Program

              



All time references are in CEST

Enhancing Survey Operations through Software Solutions: from questionnaire design to data curation

Session Organisers Dr Vilma Agalioti-Sgompou (Centre for Longitudinal Studies, UCL Social Research Institute)
Dr Aida Sanchez-Galvez (Centre for Longitudinal Studies, UCL Social Research Institute)
TimeFriday 18 July, 09:00 - 10:30
Room Ruppert D - 0.22

Survey operations increasingly rely on the use and/or development of software or code solutions across the survey lifecycle. These solutions support the survey from questionnaire design and survey operations during fieldwork, to data curation, metadata generation and sharing.
In this session we invite colleagues to share and discuss their experience and ideas on designing, developing and using software solutions or code to support and enhance survey operations, including challenges experienced, limitations, lessons learnt, and advice on better practices. These technical solutions can be in the current process of design, development, test phase, or planned for further enhancements.
We welcome presentations on software solutions and code that support survey operations, such as:
• Technical setup of questionnaire design, facilitating from the outset a smooth post-survey data processing and generation of metadata.
• Automated data curation processes, for example, technical protocols or bespoke scripts for processing and checking of survey data, paradata, and metadata, data quality assurance, validation techniques and programmatic identification of data inconsistencies or errors.
• Implementation of ETL (extract, transform and load) streamlined data workflows.
• Use of databases to store and manage data and metadata, public sharing of data management syntax sharing.
• Automated generation of documentation and versioning of data products.
Finally, we invite the presentation of solutions written in a variety of languages, such as, programming in statistical software (e.g. SPSS, Stata, SAS) or scripting languages (R, Python), or any other language/framework used, e.g. .NET), as well as long-term storage in databases (e.g. MS SQL Server, PostgreSQL, MySQL, MongoDB).

Keywords: survey life cycle, operations, software, code

Papers

Developing a Scalable and Interoperable Data Processing System for the RKI: A Comprehensive Approach

Mr Tobias Heller (Robert Koch Institute) - Presenting Author
Mrs Sabine Born (Robert Koch Institute)

Background
For more than two decades the data processing procedures of the Epidemiological Data Centre from the Robert Koch Institute evolved to a versatile and powerful system to manage and process different kinds of data (e.g. survey data, measurement data, laboratory data). The system utilizes various technologies in a complex data work flow whose sub-processes require a lot of manual intervention. Unfortunally, there are many media breaks in the established system and it's neither scalable or interoperable. The number and frequency of new data collection within the panel "Gesundheit in Deutschland" is increasing significantly, requiring a new system is to be designed, set up and operated to meet the new requirements being scalable and interoperable.
Methods
Within this project, a new overall concept for processing the research data was developed in which, among other things, international standards like FHIR are considered from the very beginning. Various types of research data can be received via a web API, formally validated and written to a raw database. An ETL tool then has access to the raw database and processes the data in a pipeline following defined rules, which are stored on a code server. In this process every transformation step is logged and can be reviewed. At the end of the process, a transformed data set is created in which various processing steps (including several validity checks, duplicate recognition and deletion, variable generation) have been performed. This data set is then made available for further internal processing via API and download.
Results and added Value
The project has been in planning for more than a year and an MVP was finalized in December 2024. New functions will be added in further development stages in the upcoming months. In the presentation, the individual components and their functions will be introduced.


Automating, Cataloguing, and Versioning of Shared Data: A Case Study from Four Longitudinal Studies

Dr Vilma Agalioti-Sgompou (Centre for Longitudinal Studies, UCL Social Research Institute) - Presenting Author
Mr Liam Curran (Centre for Longitudinal Studies, UCL Social Research Institute)
Mrs Maggie Hancock (Centre for Longitudinal Studies, UCL Social Research Institute)
Dr Aida Sanchez (Centre for Longitudinal Studies, UCL Social Research Institute, United Kingdom)
Ms Nureen Hanisah Mohammad Zaki (Centre for Longitudinal Studies, UCL Social Research Institute, United Kingdom)

We introduce a relational database and technical pipelines implemented for the management of research datasets shared over many years for four cohort longitudinal studies of the UK. The primary objective of this system is to establish an efficient way of cataloguing, storing, and versioning of these shared datasets, and when applicable, of their different versions. It provides a comprehensive overview of a large pool of data and assists with tracking of datasets’ sharing history. It also facilitates the comparison of a dataset’s versions to ensure continuity and consistency.
At the UCL Centre for Longitudinal Studies we have designed and built this data sharing system using Python and a PostgreSQL database. The presentation will cover the design, archiving mechanism, and the main pipelines involved, and will also give an overview of the development journey and lessons learnt. The technical pipelines automate various tasks, such as the ETL process of extracting (E) the data from shared SPSS datasets, transforming (T) them into components, and then importing (loaded, L) these to the database. Other core pipeline tasks include cataloguing key information about the dataset during import, archiving previous versions of the datasets, creating comparison reports with previous versions, and providing exporting tools to produce customised research datasets for sharing.
From Python, we use primarily the modules: pandas, for managing dataframes, sqlalchemy for connection and interaction with the database, and savReaderWriter for reading and writing SPSS files.


Improving questionnaire development of mixed-mode, cross-national surveys

Dr Marika de Bruijne (Centerdata) - Presenting Author
Mr Sebastiaan Pennings (Centerdata)

Developing mixed-mode surveys that result in comparable data between different modes is one of the major challenges for survey researchers. In addition, cross-national surveys face the demand to produce comparable data between different languages.
The prevailing strategy for developing mixed-mode surveys is the unified mode design, according to which questions in different modes should be kept as identical as possible. We have developed a mixed-mode survey infrastructure that supports this strategy for cross-national surveys.
The process of questionnaire development starts with the master questionnaire, which serves as the source code for all different modes and languages. This master questionnaire allows for small mode variations in texts where needed. The master questionnaire is translated in our translation management tool and applied for national implementations in different modes. While the texts for all modes are included in the questionnaire, only the texts that are applicable to a specific mode are enabled in the final implementation.
This design helps to manage the questionnaire development process and empowers proper versioning and quality assurance. It reduces the risk of unintended differences further in the process and during iterations of questionnaire amendments. Moreover, the dataset structure remains identical in all survey implementations and the resulting survey data can therefore be easily merged into one large dataset. However, the approach leaves less room for last minute adaptations and local variations.
We will present our software solution to implement mixed-mode, cross national questionnaires. We will describe the process from the questionnaire development phase up until the production of the merged dataset. This mixed-mode survey infrastructure, our DataCTRL survey tool suite, is used in large European surveys such as ESS, SHARE, Guide, and EVS.


CAPI Object-Oriented Framework

Mr Glen Heller (ICF)
Mr Alexander Izmukhambetov (Data Experts Consulting International) - Presenting Author
Ms Lindsey Anna (United States Agency for International Development)
Ms Monica Kothari (ICF)

As part of its R&D initiative, the Surveys for Monitoring in Resilience and Food Security (SMRFS) has developed a library of generalized and reusable software components, standards, and services that serve as the main platform for implementing digital survey data collection systems. Central to this framework is a technology introduced in CSPro version 7.7, which acts as an interface between native CSPro application code and a front-end web view executing HTML/JavaScript. This interface enables asynchronous invocations to the CSPro engine from static or dynamically generated web pages, thereby exposing the full functionality of CSPro to JavaScript programs. This allows developers to leverage the object-oriented nature of JavaScript language within the context of a procedural CSPro back-end and native hierarchical data models.

The SMRFS CAPI framework encapsulates this interface functionality into a library of rich and versatile visual components, subsystems, and services. These are designed to augment native CSPro functionality and provide clear, rational code abstraction for CSPro client-side application developers. The framework aims to shift the paradigm of CSPro application development from a purely procedural model to a more object-oriented and event-driven one, defining and standardizing programming patterns and conventions tailored for rapid development of distributed survey data collection systems.

The framework is designed for programmers building complex data collection and management systems. If CSPro as the basic tool, the framework becomes the upgrade, enabling developers to easily implement complex features and rich user interfaces. This reduces system development time, improves quality and user experience, and creates a robust standardized approach to application building within the CSPro environment. The SMRFS CAPI applications take advantage of these enhancements to develop tools and services that allow for modularization and standardization within the SMRFS CAPI system, reducing the number of survey-specific modifications.


Applying NLP for Efficient and Accurate Survey Data Curation

Dr Vilma Agalioti-Sgompou (Centre for Longitudinal Studies, UCL Social Research Institute) - Presenting Author
Mrs Sarah Kerry-Barnard (Centre for Longitudinal Studies, UCL Social Research Institute)
Ms Silvia Mendonca (Centre for Longitudinal Studies, UCL Social Research Institute)

We present a selection of data quality checks used in survey data curation that utilise Natural Language Processing (NLP). At the UCL Centre for Longitudinal Studies, we carry out data curation from the start of the survey data collection up to the stage of sharing the data with researchers. A variety of checks are employed to ensure the quality and consistency of the survey data. For some of these data quality checks, especially when textual data are involved, NLP tools are employed. The objectives behind using NLP are to automate and standardise the checks on textual survey data, as well as to reduce processing time. We will present examples that focus on the longitudinal continuity of participants, and the examination of potentially sensitive data to ensure safe data sharing. We will also cover additional examples of smaller scale such as the metadata quality and publication, the reviewing of back-coded responses, and different challenges in handling textual data. We use a variety of tools from Python and NLP resources such as NLTK and spaCy.