Tracking, Tracing, Training: Data Collection and Data Quality in Virtual Environments |
|
Session Organiser |
Dr Sebastian Wenz (GESIS -- Leibniz Institute for the Social Sciences) |
Time | Friday 16 July, 16:45 - 18:00 |
The papers in this session suggest and evaluate different techniques and processes of online or virtual data collection and quality control. Three papers are concerned with digital trace data or metered data: The first presents a new workflow for collecting digital trace data by exploiting EU law on data download packages (DDPs). The new approach avoids technical, methodological, and ethical problems of API-based alternatives. Paper two discusses the problem of device undercoverage for collecting metered data. It examines amount and type of untracked devices, as well as predictors of undercoverage and its effect on missing data and measurement error. The third paper presents a new total survey error (TSE) framework for metered data. The fourth paper describes challenges, solutions, and lessons learned in developing and conducting a virtual training for clinical interviewers and supervisors to effectively administer the Structured Clinical Interview for DSM-5 (SCID), the NetSCID.
Keywords: Digital Trace Data, Metered Data, Data Collection, Data Quality, Total Survey Error, Interviewer Training
Dr Laura Boeschoten (Utrecht University) - Presenting Author
Dr Jef Ausloos (University of Amsterdam)
Dr Judith Moeller (University of Amsterdam)
Dr Theo Araujo (University of Amsterdam)
Dr Daniel Oberski (Utrecht University)
Digital traces left by citizens during the natural course of modern life hold an enormous potential for social-scientific discoveries, because they can measure aspects of our social life that are difficult or impossible to measure by more traditional means.
Digital trace data collected through APIs and web scraping have been used in many applications, however, it does not always fit for social-scientific research questions. First, imposed data protections ensure these data cannot address questions of individual (user-level) dynamics or networks. Second, APIs provide public data only; much of digital trace data’s putative power, however, lies in private data that is too sensitive to share, such as location history, browsing history, or private messaging. Third, the available data generally pertain to a nonrandom subset of the digital platform’s user group (e.g. Facebook or Twitter) which is not representative of many populations of social-scientific interest. Fourth, for both approaches, the researcher is entirely dependent on the private company that holds the data; sudden retractions of this collaborative spirit can, and have, occurred, posing a risk to the research process. Finally, even when a data processing company decides to share data for scientific purposes, the citizens who actually generated those data are generally impossible to contact for their consent, in some cases putting a firm legal basis for further data analysis in question.
In this paper, we present an alternative workflow to analyze digital traces, based on data download packages (DDPs). As of May 2018, any entity, public or private, that processes the personal data of citizens of the European Union is legally obligated by the EU General Data Protection Regulation to provide that data to the data subject upon request, and in digital format. Most major private data processing entities, comprising social media platforms as well as smartphone systems, search engines, photo storage, e-mail, banks, energy providers, and online shops comply with this right to data access by providing DDPs to the data subjects. To our knowledge, most large companies that operate internationally provide the same service to their users outside of the European Union.
Our proposed workflow consists of five steps. First, data subjects are recruited as respondents using standard survey sampling techniques. Next, respondents request their DDPs with various providers, storing these locally on their own device. Stored DDPs can then be locally processed to extract relevant research variables, after which consent is requested of the respondent to send these derived variables to the researcher for analysis. To aid researchers in planning, executing, and evaluating studies that leverage the richness of DDPs, we discuss the steps involved our proposed workflow. For these purposes, traditional survey research has benefited greatly from the “total survey error” framework; here we therefore present DDP data collection in a “total error” framework adapted specifically to this new mode of data collection.
Mr Oriol J. Bosch (The London School of Economics and Political Science) - Presenting Author
Dr Melanie Revilla (Research and Expertise Centre for Survey Methodology (RECSM) - Universitat Pompeu Fabra)
Metered data, also called “web log data”, “digital trace data”, “online behavioural data” or “web-tracking data”, has been considered as the best option to measure online behaviours, since it can track the real behaviour in an inobtrusive way. Metered data is obtained from a meter willingly installed or configured by a sample of participants on their devices (PCs, tablets and/or smartphones). A meter refers to a heterogeneous group of tracking technologies that allow sharing with the researchers, at least, information about the URLs of the web pages visited by the participants.
To gain a comprehensive picture of individuals online activity, trackers should be installed on all devices an individual use and track all the behaviours that individuals do through their web-browsers and apps, in each of the networks they are connected at. Several problems can prevent this from happening. For instance, tracking technologies might not be installable in some devices or participants might decide not to install or configure the meter in some device, browser and/or network. These problems can cause 1) device, 2) bowser, 3) in-app and 4) network undercoverage. This can produce that all the information or part of the information of interest for a given participant is missing, which can introduce errors when aiming to make inferences about a theoretical concept for finite populations (e.g. average time spent visiting online news outlets for the adult population using Internet living in the UK).
Therefore, in this presentation we explore the extent of this problem and its potential consequences. First, we explore the proportion of participants undercovererd. Besides, we explore the amount and type of devices not tracked as well as the predictors of being undercovered. Second, due to the nonreactive nature of metered data (i.e. data is not collected by soliciting a response by individuals, Sen et al., 2019), under-coverage can provoke missing data or measurement errors, depending on how missingness due to undercoverage is dealt with. The approach used can have implications to substantive conclusions. Hence, we discuss when undercoverage can produce missing data or measurement error. Besides, we explore the substantive implications of different approaches to deal with missing information provoked by undercoverage.
To do so, in this presentation we combine metered and survey data from several samples of metered individuals from three countries in Europe (Spain, Portugal and Italy). Although results are still not available, we expect this presentation to help researchers understand 1) the problem of undercoverage and 2) how to better deal with it.
Mr Oriol J. Bosch (The London School of Economics and Political Science) - Presenting Author
Dr Melanie Revilla (Research and Expertise Centre for Survey Methodology (RECSM) - Universitat Pompeu Fabra)
Metered data (also called “web log data”, “digital trace data”, “web-tracking data” or “passive data”) is a type of Big Data obtained from a meter willingly installed or configured by participants on their devices (PCs, tablets and/or smartphones). A meter refers to a heterogeneous group of tracking technologies that allow sharing with the researchers, at least, information about the URLs of the web pages visited by the participants. Depending on the technology used, HTML, time or device information can also be collected. Metered data is objective, free of human memory limitations and produced in real-time. Therefore, metered data has a great potential to replace part of survey data or to be combined with survey data to obtain a more complete picture of reality. Metered data, nevertheless, needs to be used properly. It is crucial to understand its limitations to mitigate potential errors. To date, some research has explored potential error causes of metered data. However, a systematic categorization and conceptualization of these errors is missing.
Therefore, in this presentation, we present a framework of all errors that can occur when using metered data. We adapt the Total Survey Error (TSE) framework (Groves et al., 2009) to accommodate it to the specific error generating processes and error causes of metered data. The adapted error framework shows how the unique characteristics of metered data can affect data quality, but also allows comparing metered data errors with survey errors, since it is based on the TSE framework. Hence, the adapted framework can be useful when using metered data alone or in combination with surveys, to choose the best design options for metered data, but also to make better informed decisions while planning when and how to supplement or replace survey data with metered data.
Dr Heidi Guyer (RTI International) - Presenting Author
Dr Leyla Stambaugh (RTI International)
Dr Paul Geiger (RTI International)
Ms Kathleen Considine (RTI International)
The National Study of Mental Health (NSMH) will provide prevalence rates of serious mental illness and substance use disorders among U.S. adults age 18 to 65 years. A primary goal of the study is to include both household and non-household populations, for the first time, in order to accurately estimate the mental health of the population. Participants will be selected from prisons, homeless shelters, state psychiatric hospitals, jails and group housing quarters, in addition to an area probability sample of dwelling addresses across the U.S.. Data collection will be conducted using a CAPI instrument, including a computerized version of the Structured Clinical Interview for DSM-5 (SCID), the NetSCID. In order to effectively administer the clinical interview with the target study populations, the project team had to build a workforce of more than 70 Clinical Interviewers and 10 Clinical Supervisors with strong interviewing skills and clinical knowledge. To identify and recruit such a workforce, a virtual hiring protocol was developed to specifically target individuals with foundational education and experience in clinical research. This national hiring process resulted in over 1300 applicants and the successful selection and onboarding of an interdisciplinary group of highly-qualified, clinically-trained research interviewers with backgrounds in clinical psychology, social work, psychiatry, and clinical trials. Interviewer training was originally scheduled for July 2020 and with data collection beginning in August 2020. Training was delayed by three months, due to the COVID-19 pandemic, and was shifted to a fully virtual training. A 40-hour training was planned followed by up to three certification attempts with actual patients. Numerous adaptations were required in order to train a large group of new interviewers and supervisors virtually, located across the U.S. and in various time zones, on the complex technical systems and complex clinical interview. Each participant received a pre-configured laptop and a tablet computer at their home prior to training. Pre-recorded training modules were provided with instructions on equipment and systems set-up. The pre-training videos, along with check-in calls prior to the training, ensured that all trainees were prepared for the live virtual training. Clinical training on the administration of the SCID involved a mix of pre-recorded modules provided via e-learning platforms, and live, synchronous meeting and discussions with group breakouts and supervision from clinical experts. After the 40-hour training was completed, all Clinical Supervisors and Clinical Interviewers began the next phase of preparation during which they practiced via role plays with clinicians and conducted up to three virtual interviews with live patients. This allowed the newly trained staff to test out their equipment and systems and receive supervision on their clinical interviewing skills. In this session, we will describe the multiple platforms used to train interviewers, the systems used by trainers to facilitate the learning environment, the feedback we received from the training participants, and the rigorous certification program and outcomes. Lessons learned in each of these areas will be shared as well.