Data management and processing |
|
Session Organiser | Dr Katharina Meitinger (Utrecht University) |
Time | Friday 23 July, 15:00 - 16:30 |
this session contains four presentations about data management and processing.Sebastian Netscher will talk about Domain Data Protocols for Educational Research in Germany. Modesto Escobar will talk about Interactive Network Graphs Online to Analyze Surveys. Christof Wolf will talk about KonsortSWD: A New Joint German Research Data Infrastructure for the Social Sciences. And finally, Anja Perry will talk about
Measuring the Costs of Research Data Management.
Dr Sebastian Netscher (GESIS - Leibniz-Institute for the Social Sciences) - Presenting Author
Mrs Anna Schwickerath (GESIS - Leibniz-Institute for the Social Sciences)
Researchers are increasingly encouraged by different stakeholders to make research processes as transparent as possible, to enable reproducible research results and to share their (research) data FAIRly and openly with others. Such requirements can be challenging, as not all researchers are familiar with the concepts of FAIR and open data. At the same time, existing tools (and guidance) to foster the creation of FAIR data – such as templates for data management plans – vary to a great extent, rarely indicating what the best practice solution is.
The project Domain Data Protocols for Educational Research in Germany aims to address this issue by developing a standardized tool to create data management plans. Funded by the German Federal Ministry of Education and Research, it brings together twelve German research institutions with diverse areas of expertise on educational research to develop so called Domain Data Protocols (in short DDPs).
Based on a concept by Science Europe , DDPs are open, standardized, and referenceable data protocols, serving as a ‘model’ data management plan for a specific research domain, i.e. educational research in our case. DDPs are for the benefit of various stakeholders. First, they assist researchers in doing excellent data management, preparing project proposals and funding applications as well as offering support for data archiving and sharing. Second, they enable replication of results by the research community as well as the re-use of data by others in new (research) contexts. Third, DDPs simplify review processes on data management, reducing the efforts of examining funding applications and (periodical) reports on data management by implementing standardized procedures. Finally, DDPs foster data ingest in a data repository or archive, by fostering researchers in the creation of FAIR data.
The development of DDPs is not without challenges, as their structure needs to be flexible enough to cover different types of data and methods and to enable researchers to reflect their individual project-specific requirements. DDPs therefore consist of different modules, e.g. in the context of data collection, documentation, legal issues, and data sharing. Each of these modules contains different elements defining a minimum set of requirements on what FAIR data look like and includes use cases, standards, relevant regulations as well as further resources on related data management practices.
Dr Anja Perry (GESIS - Leibniz Institute for the Social Sciences) - Presenting Author
Dr Sebastian Netscher (GESIS - Leibniz Institute for the Social Sciences)
Barend Mons states in his 2020 Nature article that around 5% of the overall research budget should go towards research data management (RDM) activities. However, such lump sums ignore differences across research projects and may question higher requested RDM budgets whenever projects are more complex. Research on RDM costs is still in its infant phase and precise statements about which RDM activities are eligible for funding are rare. Clear and more precise recommendations on how to budget RDM activities in research projects will help to raise awareness of the importance of RDM, devote appropriate time and effort towards RDM tasks, and help researchers to accommodate for these tasks when applying for research funding. They will consequently improve data sharing and reuse.
However, budgeting RDM activities is difficult and complex and depends on the type of data, their volume, and information included. In this paper, we aim to investigate the costs of RDM activities, more specifically of data cleaning and documentation. We make use of a pilot study conducted at the GESIS Data Archive for the Social Sciences in Germany between December 2016 and September 2017. During this period, data curators at the GESIS Data Archive documented their working hours while cleaning and documenting data from ten quantitative survey studies. We analyze this documentation and interview data curators to identify and examine important cost drivers in RDM, i.e., aspects that increase hours spent on these tasks, and factors that lead to a reduction of their work.
Mr Christof Wolf (GESIS Leibniz-Institute for the Social Sciences) - Presenting Author
The presentation introduces a new initiative of Germany to improve the research data infrastructure. In 2020 the Federal Republic of Germany and its Länder have started funding a national research data infrastructure (NDFI) across all scientific disciplines. It currently gathers nine consortia representing diverse fields as, for example, engineering, population health, catalysis-related sciences or cultural studies. Within the NFDI KonsortSWD represents the social, educational, behavioral and economic sciences. The consortium is based on a network of to date 38 research data centers (RDC) which together hold over 4,300 datasets and serve over 50,000 data users every year, many of which are not based in Germany.
The consortium’s main objectives for the first funding period of five years are to:
- Strengthen the FAIRness of data and metadata in particular by focussing more strongly on introducing aspects of RDM into the planning and collecting phase of empirical projects. This objective will be illustrated by introducing one work package tasked with creating concepts to supply DOIs not only on the level of datasets but also on sub-levels such as concepts or variables. Additionally, we strive for improving the interoperability of data by providing scripts for ex-post harmonization of much used demographic, social and cultural variables in prominent datasets.
- Widen the scope of available data by enlarging the network of RDCs and thereby making new data available for even wider communities. This objective will be illustrated by introducing our efforts to bring data from qualitative social research, i.e. text, auditive or video data, into the data landscape coordinated by KonsortSWD. A related objective is improving methods to link relevant context data to social-science text-corpora such as social media or party manifestos.
- Deepen the cooperation between RDCs to overcome obstacles of the fragmented data landscape by generating common standards for data access, joint RDM-trainings and support for RDCs in their efforts of quality assurance and management. This objective will be illustrated by highlighting a work package aiming at making sensitive data available at reduced cost by offering more access points to these data.
This national initiative will closely collaborate with European research infrastructures for the social sciences such as CESSDA, ESS and SHARE as well as with general research infrastructures such as the European Open Science Cloud (EOSC).
Mr Modesto Escobar (Universidad de Salamanca) - Presenting Author
Mr Pablo Escobar (University of Essex)
This demo presents an interactive dynamic tool for data visualization and analysis. The workshop will cover the basics of coincidence analysis and the use of netCoin, a package to create web analytic graphs. This kind of graphs have been employed not only to solve topographic problems and to represent network structures, such as those in social media, but also to show the correlation between variables according to casual models. Path analysis and structural equations models are indeed well known by social scientists, but both were restricted to quantitative variables at their early stages. In this tutorial, we will propose a new way to display social media, and connections between qualitative variables in a similar way to the correspondence analysis, but using another set of multivariate techniques, such as linear and logistic regression, mixed with network analysis. For example, one of the specific uses of this analysis technique involves the characterization of different response profiles by diverse sociodemographic variables.
The NCA (network coincidence analysis) explores links or co-occurrences of people, characteristics, or events under certain circumstances. The R package netCoin implements this coincidence analysis and generates attractive interactive data visualizations. This package allows you to analyze relationships in survey data, connections using social media data, or any other type of network data by drawing interactive plots. netCoin can also be used to visualize statistical models like linear regressions or structural equation modelling. The workshop will include examples using both social media and survey data.
Ms Geneviève Michaud (Sciences Po) - Presenting Author
Mr Quentin Agren (Sciences Po)
Ms Agnalys Michaud (Sciences Po)
When it comes to setting up a service to manage cross-national samples for longitudinal internet surveys, researchers and research projects are in need of a responsive software solution that accommodates limited resources, be it financial resources, time or operational skill sets. The solution should be user-friendly, providing straightforward online training material and users guides. It should minimize the complexity of managing a cross-national sample, and allow to coordinate and harmonize processes across participating countries, tackling multilingualism, synchronization and collaboration issues. The solution should comply with ethics and legal rules regarding personal data protection. It should as well come with pre-defined roles and corresponding accreditations to access specific data subsets, on a need-to-know basis, while facilitating central management. Furthermore, for security concerns, data flows should be minimized.
In the framework of the work package "innovations in data production", part of the european programme Horizon 2020 Social Sciences Open Cloud (SSHOC, grant 823782) as well as ESS-SUSTAIN-2 (grant 871063), the European Social Survey (ESS ERIC) and Sciences Po have been collaborating since 2019 to provide a tool to manage cross-national samples for longitudinal internet surveys. After reaching the conclusion that no off-the-shelf software solution meets our needs, we have been designing and developing a software application to take up these challenges.
During the session, we will run through a lively demonstration of the first version of our fully functional web panel sample service (WPSS) paired through an Application Programming Interface (API) with the Qualtrics software platform. Following a realistic scenario, we will play each stakeholder's role and give a comprehensive tour of the software features: from distributed import and export of sub-sample contact data, to centralized invites and reminders using email and short text messages or the panelists' portal ...