ESRA logo

ESRA 2023 Glance Program


All time references are in CEST

Sharing routes for individual-level research data and code

Session Organisers Dr Aida Sanchez-Galvez (Centre for Longitudinal Studies, UCL Social Research Institute )
Ms Cristina Magder (UK Data Service, UK Data Archive, University of Essex)
TimeTuesday 18 July, 09:00 - 10:30
Room

Sharing individual-level research data is a core activity of many research projects, which often collect and manage a variety of data types, such as survey data, biomarkers, complex sensor data, genomics, linked administrative data, and geographical data. Some researchers are also generating synthetic data for teaching purposes and preliminary development of analysis code. Each of these data types presents its own challenges when it comes to sharing and dissemination for future research purposes. Additionally, the sharing of programming code is fundamental in ensuring reproducibility and transparency.
Data releases are internally managed by the studies themselves, and/or externally by national archives or Trusted Research Environments. Balancing the wide sharing of detailed research data with the need to maintain confidentiality and security, while also ensuring easy and swift access without significant barriers or delays, is a complex challenge. This balance becomes further challenging when dealing with sensitive and/or potentially disclosive data. Data that fall under the GDPR definition of “special category data” require additional protection and a higher degree of security and governance measures often involving Data Access Committees oversight and dedicated legal and sharing frameworks.
The aim of this session is to provide a platform for colleagues to discuss their experience and approaches to sharing individual-level research data, sensitive and non-sensitive, original or synthetic. Participants are encouraged to share their techniques to assess and manage disclosure risk, and best practices and challenges of code sharing. We invite colleagues to submit ideas relating to, but not restricted to:
- Sharing routes for individual-level research data
- Publication of programming code or syntax
- Management and sharing of synthetic data
- Methods of risk assessment of disclosivity and sensitivity
- Research data classification or data tiers
- Technical tools used to generate bespoke datasets
- Data access via Trusted Research Environments
- International data sharing

Keywords: data, sharing, disclosure, sensitive, code, synthetic

Papers

Balancing Privacy and Data Utility: Comparison of Convex Combination of Group Responses and Randomized Regrouped Responses in Statistical Disclosure Control

Professor Shu-Mei Wan (Lunghwa University of Science and Technology)
Professor Chien-Hua Wu (Chung-Yuan Christian University) - Presenting Author

The proposed techniques offer a well-balanced solution for statistical disclosure control (SDC), ensuring the data remains both useful and confidential.
Two statistical techniques, RRR and CCGR, are evaluated. These techniques employ transition probabilities to manage group assignments and response combinations, respectively. RRR allows for the reassignment of subjects across groups using a randomization mechanism, and conventional statistical inference methods can be applied to account for randomization effects. CCGR combines responses from multiple groups into a single response through a weighted average, where the weights are determined by transition probabilities.
For instance, as the transition probability approaches 1/2 in a two-group study, distributions between groups become more similar, reducing statistical power. Simulation studies show that when retention probabilities approach 1/2 for two groups, the power of the released data diminishes, though adjusted powers remain stable. Considering the disclosure risk, lower retention probabilities are recommended when using the adjusted test statistic.
CCGR effectively balances data utility and confidentiality in statistical analyses, especially in areas where privacy is critically important.


How-To Guidance for TREs to be, setup and being interoperable in a multinational federation of TREs. EOSC-ENTRUST Driver 2

Dr Deb Wiltshire (GESIS-Leibniz Institute for the Social Sciences) - Presenting Author
Ms Beate Lichtwardt (UK Data Service)
Dr Sharon Bolton (UK Data Service)
Mr Peter Hegedus (TARKI)

The EOSC-ENTRUST project aims to create a ‘European Network of TRUSTed research environments’ for sensitive data and to drive interoperability by developing a common blueprint for federated data access and analysis. Data governance, legislation and management all differ between its international partner institutions. How to reconcile these differences to enable the wider sharing of sensitive data, is at the heart of this project. EOSC-ENTRUST Driver 2 represents one of four use cases, which are prototypic for federated, multinational use of TREs in research practice across scientific domains and user communities. Our presentation will outline Driver 2’s use case. Testing and building on a framework established under the SSHOC project for implementing trans-national data sharing agreements, enabling and setting up new remote access connections is one aspect, but it would not be applicable/enough/sufficient as it assumes a level of existing maturity of a TRE. What is needed additionally, is some guidance for those organisations who are looking to set up a new TRE from scratch. Our talk will address the latter.


Reproducible research for social scientists: interactive notebooks, version control repositories and digital object identifiers

Dr Jools Kasmire (UK Data Service / University of Manchester) - Presenting Author

In school, we are all told to show our work. But as researchers and academics, how often do we really show all of our work? There are very good reasons why the ‘show your work’ practices that were easy with math homework are much harder to apply to years-long projects using survey data on sensitive topics! Still, if important research on topics with social value are to be trusted or used to inform policy, then the research must be sufficiently transparent, comprehensible and reproducible.
Trustworthy research requires overcoming the challenges that make it hard to document all the steps taken and decisions made during the research project as well as all the ways that the data was processed from original or raw forms through to final analysis and visualisations. The challenges behind documenting research transparently are not trivial or simplistic and may be especially difficult for those working in social sciences, working with qualitative data, or working with data that contains personal or private information. Despite the multiple, complex and interacting factors that make documenting research difficult, there are many more tools available now that can reduce, ameliorate or sometimes even eliminate the challenges so that researchers can show their work properly. This presentation introduces a few key tools that researchers can incorporate into their work practices that will make it easy for others to understand, inspect, cite, and further develop their research. Specifically, this presentation covers how interactive notebooks, version control repositories and digital object identifiers can enhance the transparency, reproducibility and citability of research with special attention paid to how these tools play out for social scientists and others who work with qualitative, secure, restricted or otherwise controlled data.


Sixty years of social science data sharing: legal and ethical frameworks in practice at the UK Data Archive

Mrs Susan Cadogan (UK Data Service, UK Data Archive, University of Essex) - Presenting Author

As the UK Data Archive, the lead partner of the UK Data Service, approaches its 60th anniversary, it stands as a proof of sustained innovation in data sharing for social sciences. Continuously funded by the Economic and Social Research Council (now part of UK Research and Innovation) since its inception, the Archive has pursued a clear mission: to build a collection of data valuable to researchers and to negotiate access to meet their needs. Central to this mission is balancing wider data access with long-term usability while addressing the practical, legal, and ethical challenges of data deposit.

The early years involved significant negotiation with data creators, who often feared early publication by others, critical scrutiny, or misinterpretation of their data. These challenges were counterbalanced by the benefits of archiving, preservation, managing access, ensuring citation, and, more recently, assigning DOIs, while offering support throughout the process.

At the centre of these efforts is a robust legal and ethical framework. This ensures depositors have the rights to deposit data, protects their rights, and upholds standards around privacy, consent, and responsible use. Negotiated licence agreements align depositor goals with user needs, ensuring data are as open as possible, with restrictions applied where necessary.

This presentation will review the evolution of our three-tier licence and access framework over the past 60 years and its adaptation to emerging challenges. We will examine the key role of legal and ethical standards in balancing openness with restrictions, and how these principles intersect with broader open access frameworks and repositories for the social sciences.


Enhancing Code Discovery: The ODISSEI Code Library

Dr Angelica Maria Maineri (Erasmus University Rotterdam/ODISSEI) - Presenting Author

Sharing analytical code as a part of scientific publication enhances trust by allowing scrutiny and replication of results, promoting collaborative learning. This holds true even when code runs on sensitive data that cannot be distributed alongside the code. However, there are many barriers to code sharing, ranging from a lack of time to prepare it to concerns over research scooping (see Krähmer, Schächtele and Schneck, 2023). Moreover, the incentives for sharing well-documented code are often low, since pieces of code scattered across different registries (e.g., OSF, Zenodo, ResearchBox) are difficult to find and reuse.

At ODISSEI, the Dutch national infrastructure for the social sciences, we started building a library of analytical code used in studies using the LISS panel data and the administrative microdata held by Statistics Netherlands (CBS) to make them easier to retrieve and reuse (see https://odissei-data.github.io/ODISSEI-code-library/). For each project, next to the link to the code, we provide the link to the paper and, when available, a DOI directly pointing to the dataset used. The ODISSEI Code Library can be used for searching code on a given topic. Moreover, its content is harvested by the ODISSEI Knowledge Graph, which enables complex queries across the different ODISSEI facilities.

The presentation will focus on the set up of the ODISSEI Code Library and its contribution to open knowledge. Practical steps on how to build a similar catalog will be offered. Moreover, it will be shown how the code library allows promoting good practices with code sharing for the whole ODISSEI community.


Longitudinal Data Sharing: Direct Release vs Controlled Access

Dr Aida Sanchez-Galvez (UCL Centre for Longitudinal Studies)
Ms Claudia Yogeswaran (UCL Centre for Longitudinal Studies) - Presenting Author

Striking the right balance between maximising research use of longitudinal survey data and minimising risks to participants' rights is a complex challenge. Research data sharing must ensure data are widely available to the international research community in a fair, open, and transparent manner, while guaranteeing: i) sensitive and/or disclosive data are shared securely; ii) compliance with legal, ethical, and moral responsibilities to participants; and iii) adherence to consent agreements.
The UCL Centre for Longitudinal Studies (CLS) manages several national longitudinal cohort studies, which follow the lives of tens of thousands of people in the UK. CLS facilitates two levels of data access, which represent fundamentally different approaches to data dissemination and control: 1) direct data release to users, for analysis in their institutional servers; and 2) remote access via Trusted Research Environments (TREs), which are secure servers operating under the Five Safes Framework. The sensitivity and disclosure risk of the longitudinal data determines the appropriate data sharing route.
Direct data release allows ease of access and is the primary CLS method for sharing individual-level survey data. This approach is only suitable for data that have undergone thorough assessment by the CLS data management team to ensure low sensitivity and minimal identification risk. This distribution is safeguarded, as it requires registration via the UK Data Service and an application process, with data usage governed by an End User Licence and/or the CLS Data Sharing Agreement. Re-identification of individuals is strictly forbidden.
Conversely, TREs are used to share highly sensitive data or data with significant disclosivity risk. Despite being highly restrictive and often seen as a barrier to agile research, this model has been gaining popularity over the last few years and has resulted in a proliferation of TREs in the UK and across European countries.


The NextGen Harmonised Data Gateway

Dr Rabia Karatoprak Ersen (GESIS - Leibniz Institute for the Social Sciences ) - Presenting Author
Dr Insa Bechert (GESIS - Leibniz Institute for the Social Sciences )

For The EU project, Infra4NextGen, the GESIS - Leibniz Institute for the Social Sciences provides data and research infrastructure services focused on the five NextGenEU youth policy topic areas: Make it Green; Make it Digital; Make it Healthy; Make it Strong; and Make it Equal. GESIS provides users with a set of cross-national data files containing harmonized and merged data on the five themes from the European Social Survey, Generations and Gender Programme, European Values Study, International Social Survey Programme, European Quality of Life Surveys, and Eurobarometer. Beyond that, we design and set up virtual access to metadata overviews, the harmonized data files, and R scripts used for the harmonisation. In this presentation, we will introduce https://infra4nextgen.com/harmonisationgateway/ focusing on two sub-webpages: Variable Database and Harmonisation.

The Variable Database is a compilation of measurement items, which are all key variables in the five pillars of EU youth policy. It includes the items that have measured the same concept similarly across countries by at least two survey programmes within the last 20 years. The Harmonisation consists of pages dedicated to the harmonisation procedures in production of the cross-national data files using the selected items from the Variable Database. It demonstrates the step-by-step harmonisation procedures from the beginning to the end using R scripts for each data file.

The https://infra4nextgen.com/harmonisationgateway/ is not only an entry point but also a rich source as a service to the research. These webpages both provide access to the harmonised data sets and open ways for individual-level research. Individuals who have active research programme or decision-makers at the institution level can curate their data set using the variables displayed on the Variable Database and R scripts provided on the Harmonisation page.