ESRA 2025 Preliminary Program
All time references are in CEST
Sharing routes for individual-level research data and code 2 |
Session Organisers |
Dr Aida Sanchez-Galvez (Centre for Longitudinal Studies, UCL Social Research Institute ) Ms Cristina Magder (UK Data Service, UK Data Archive, University of Essex)
|
Time | Wednesday 16 July, 09:00 - 10:30 |
Room |
Ruppert 011 |
Sharing individual-level research data is a core activity of many research projects, which often collect and manage a variety of data types, such as survey data, biomarkers, complex sensor data, genomics, linked administrative data, and geographical data. Some researchers are also generating synthetic data for teaching purposes and preliminary development of analysis code. Each of these data types presents its own challenges when it comes to sharing and dissemination for future research purposes. Additionally, the sharing of programming code is fundamental in ensuring reproducibility and transparency.
Data releases are internally managed by the studies themselves, and/or externally by national archives or Trusted Research Environments. Balancing the wide sharing of detailed research data with the need to maintain confidentiality and security, while also ensuring easy and swift access without significant barriers or delays, is a complex challenge. This balance becomes further challenging when dealing with sensitive and/or potentially disclosive data. Data that fall under the GDPR definition of “special category data” require additional protection and a higher degree of security and governance measures often involving Data Access Committees oversight and dedicated legal and sharing frameworks.
The aim of this session is to provide a platform for colleagues to discuss their experience and approaches to sharing individual-level research data, sensitive and non-sensitive, original or synthetic. Participants are encouraged to share their techniques to assess and manage disclosure risk, and best practices and challenges of code sharing. We invite colleagues to submit ideas relating to, but not restricted to:
- Sharing routes for individual-level research data
- Publication of programming code or syntax
- Management and sharing of synthetic data
- Methods of risk assessment of disclosivity and sensitivity
- Research data classification or data tiers
- Technical tools used to generate bespoke datasets
- Data access via Trusted Research Environments
- International data sharing
Keywords: data, sharing, disclosure, sensitive, code, synthetic
Papers
Balancing Privacy and Data Utility: Comparison of Convex Combination of Group Responses and Randomized Regrouped Responses in Statistical Disclosure Control
Professor Shu-Mei Wan (Lunghwa University of Science and Technology)
Professor Chien-Hua Wu (Chung-Yuan Christian University) - Presenting Author
The proposed techniques offer a well-balanced solution for statistical disclosure control (SDC), ensuring the data remains both useful and confidential.
Two statistical techniques, RRR and CCGR, are evaluated. These techniques employ transition probabilities to manage group assignments and response combinations, respectively. RRR allows for the reassignment of subjects across groups using a randomization mechanism, and conventional statistical inference methods can be applied to account for randomization effects. CCGR combines responses from multiple groups into a single response through a weighted average, where the weights are determined by transition probabilities.
For instance, as the transition probability approaches 1/2 in a two-group study, distributions between groups become more similar, reducing statistical power. Simulation studies show that when retention probabilities approach 1/2 for two groups, the power of the released data diminishes, though adjusted powers remain stable. Considering the disclosure risk, lower retention probabilities are recommended when using the adjusted test statistic.
CCGR effectively balances data utility and confidentiality in statistical analyses, especially in areas where privacy is critically important.
How-To Guidance for TREs to be, setup and being interoperable in a multinational federation of TREs. EOSC-ENTRUST Driver 2
Dr Deb Wiltshire (GESIS-Leibniz Institute for the Social Sciences) - Presenting Author
Ms Beate Lichtwardt (UK Data Service)
Dr Sharon Bolton (UK Data Service)
Mr Peter Hegedus (TARKI)
The EOSC-ENTRUST project aims to create a ‘European Network of TRUSTed research environments’ for sensitive data and to drive interoperability by developing a common blueprint for federated data access and analysis. Data governance, legislation and management all differ between its international partner institutions. How to reconcile these differences to enable the wider sharing of sensitive data, is at the heart of this project. EOSC-ENTRUST Driver 2 represents one of four use cases, which are prototypic for federated, multinational use of TREs in research practice across scientific domains and user communities. Our presentation will outline Driver 2’s use case. Testing and building on a framework established under the SSHOC project for implementing trans-national data sharing agreements, enabling and setting up new remote access connections is one aspect, but it would not be applicable/enough/sufficient as it assumes a level of existing maturity of a TRE. What is needed additionally, is some guidance for those organisations who are looking to set up a new TRE from scratch. Our talk will address the latter.
Reproducible research for social scientists: interactive notebooks, version control repositories and digital object identifiers
Dr Jools Kasmire (UK Data Service / University of Manchester) - Presenting Author
In school, we are all told to show our work. But as researchers and academics, how often do we really show all of our work? There are very good reasons why the ‘show your work’ practices that were easy with math homework are much harder to apply to years-long projects using survey data on sensitive topics! Still, if important research on topics with social value are to be trusted or used to inform policy, then the research must be sufficiently transparent, comprehensible and reproducible.
Trustworthy research requires overcoming the challenges that make it hard to document all the steps taken and decisions made during the research project as well as all the ways that the data was processed from original or raw forms through to final analysis and visualisations. The challenges behind documenting research transparently are not trivial or simplistic and may be especially difficult for those working in social sciences, working with qualitative data, or working with data that contains personal or private information. Despite the multiple, complex and interacting factors that make documenting research difficult, there are many more tools available now that can reduce, ameliorate or sometimes even eliminate the challenges so that researchers can show their work properly. This presentation introduces a few key tools that researchers can incorporate into their work practices that will make it easy for others to understand, inspect, cite, and further develop their research. Specifically, this presentation covers how interactive notebooks, version control repositories and digital object identifiers can enhance the transparency, reproducibility and citability of research with special attention paid to how these tools play out for social scientists and others who work with qualitative, secure, restricted or otherwise controlled data.
Enhancing Code Discovery: The ODISSEI Code Library
Dr Angelica Maria Maineri (Erasmus University Rotterdam/ODISSEI) - Presenting Author
Sharing analytical code as a part of scientific publication enhances trust by allowing scrutiny and replication of results, promoting collaborative learning. This holds true even when code runs on sensitive data that cannot be distributed alongside the code. However, there are many barriers to code sharing, ranging from a lack of time to prepare it to concerns over research scooping (see Krähmer, Schächtele and Schneck, 2023). Moreover, the incentives for sharing well-documented code are often low, since pieces of code scattered across different registries (e.g., OSF, Zenodo, ResearchBox) are difficult to find and reuse.
At ODISSEI, the Dutch national infrastructure for the social sciences, we started building a library of analytical code used in studies using the LISS panel data and the administrative microdata held by Statistics Netherlands (CBS) to make them easier to retrieve and reuse (see https://odissei-data.github.io/ODISSEI-code-library/). For each project, next to the link to the code, we provide the link to the paper and, when available, a DOI directly pointing to the dataset used. The ODISSEI Code Library can be used for searching code on a given topic. Moreover, its content is harvested by the ODISSEI Knowledge Graph, which enables complex queries across the different ODISSEI facilities.
The presentation will focus on the set up of the ODISSEI Code Library and its contribution to open knowledge. Practical steps on how to build a similar catalog will be offered. Moreover, it will be shown how the code library allows promoting good practices with code sharing for the whole ODISSEI community.
Revitalizing Older Data for Modern Research: Streamlining the Curation of Legacy Survey Data
Dr Sharon Bolton (UK Data Service) - Presenting Author
Long-standing data archives often house a treasure trove of older survey data covering a wide range of topics, but these valuable resources can be difficult for today's researchers to access without extensive support. While expert collection management can upgrade older datasets to modern formats and standards, the process is often time-consuming and complex. In an era of limited resources and growing data curation demands, the need for efficient tools is more pressing than ever. This presentation will showcase the data enhancement strategies we use at the UK Data Archive to breathe new life into legacy survey data. Using tools from the Data Curation Network, we are transforming these rich historical datasets into accessible resources that continue to fuel impactful research.