ESRA logo
Tuesday 18th July      Wednesday 19th July      Thursday 20th July      Friday 21th July     




Tuesday 18th July, 09:00 - 10:30 Room: Q2 AUD1 CGD


Using paradata to assess and improve survey data quality 1

Chair Dr Caroline Vandenplas (KULeuven )
Coordinator 1Professor Geert Loosveldt (KULeuven)
Coordinator 2Dr Koen Beullens (KULeuven)

Session Details

Survey methodologists are currently facing challenges of declining response rates, increasing risk of nonresponse bias and measurement error, as well as escalating costs of survey data collection. An approach, with limited costs, to tackle these challenges is the use of paradata. Paradata, data about the survey process, have always been present but the range and detail level of them have considerably increased with the computerization of the data collection process. Such data can be used to detect and eventually reduce systematic survey errors and increase data quality, during the fieldwork (adaptive designs) or in post-survey adjustment. Paradata can also be used to reduce the cost of the survey process as it is done to determine caps on the number of phone call attempts in telephone surveys.
We are interested in papers that apply the use of paradata to detect and improve data quality or/and reduce survey costs. For instance, time and timing are both linked to the survey costs and the data quality, two essential elements of a survey. The timing of the visits, calls or sent-out of questionnaire/request and reminders has been shown to be determining for survey participation. At the same time, requesting that interviewers work in the evening or at the weekend or making sure that the reminders to a Web or mail surveys are sent timely may have cost implications. Nonresponse error is not the only type of survey error to be linked to time: the time taken to answer a question, also called response latency, is known to echo the cognitive effort of the respondent and, hence, data quality. On the other hand, the interviewer speed can also influence data quality. Moreover the interviewer speed has been shown to be dependent of the rank of the interview.

The aim of this session is to reflect on possible links between paradata reflecting ‘easy’ measured characteristic of different steps of the survey process and data quality. Such a link could then help data collection manager and researcher to detect potential systematic survey errors in a fieldwork monitoring or post-evaluation context and lead to opportunities to prevent or correct for these errors. We invite papers demonstrating a link between paradata and data quality as well as papers showing how this link can be used to increase data quality or reduce cost.

Paper Details

1. Fieldwork monitoring and managing with time-related paradata
Dr Caroline Vandenplas (KULeuven)

In this day and age, time is a critical element of anybody’s live, especially for the active population. The same holds for different facets of the survey process: time and timing are both linked to the survey costs and the data quality, two essential elements of a survey. During the data collection period, the available time of potential respondents play a key role in their decision to participate or not to the survey whilst the interviewer time, for telephone or face-to-face surveys, is an important factor in his/her capacity to recruit respondents. The timing of the visits, calls or sent-out of questionnaire/request and reminders has also been shown to be determining for survey participation. At the same time, requesting that interviewers work in the evening or at the weekend or making sure that the reminders to a Web or mail surveys are sent timely may have cost implications. Nonresponse error is not the only type of survey error to be linked to time: the time taken to answer a question, also called response latency, is known to echo the cognitive effort of the respondent and, hence, data quality. On the other hand, the interviewer speed can also influence data quality. Moreover the interviewer speed has been shown to be dependent of the rank of the interview.
In this presentation, we will give a few examples on how time-related paradata can be used to detect survey error and used to improve data quality in a fieldwork monitoring perspective. Using the data from the European Social survey, we will illustrate how the yield of the fieldwork per time unit that can be derived from the contact forms and the interview speed that can be derived from timers could guide us in decision making with the aim to improve data quality during the fieldwork.


2. Monitoring interview duration data: from a Statistical Process Control perspective
Ms Jiayun Jin (KU Leuven)
Professor Geert Loosveldt (KU Leuven)
Dr Caroline Vandenplas (KU Leuven)

Paradata, generally referred to the data about survey process, has been widely suggested to be collected and monitored during surveys to help researchers gain an understanding of what is going on, make interventions, eventually yield better quality statistical estimates (e.g., responsive designs and adaptive designs). However, the prevalent goal has been to improve the representativity of the respondents. Measurement error, occurring when a response is different from the “true” value, has received less attention in the area.

In this paper, we use Statistical Process Control (SPC) methods and multilevel analysis to describe how paradata can be used to monitor data collection process with a focus on measurement error. Specifically, we consider interview duration, which has been commonly used as indicator of measurement error. Using SPC methods, we apply control charts on interview duration to monitor the average and variation continuously during the process, and detect interviewers and respondents associated with too long or too short interview duration that are statistically out of control (typically with the rule of 3 standard deviations above or below the average). SPC suggests using the outliers as the proxy indicator for measurement errors. Using multilevel analysis with respondents nested within interviewers, we attempt to identify the characteristics of respondents and interviewers which have significant effects on interview duration. Data from the European Social Survey, a biennial face-to-face survey conducted across Europe since 2002 are used to perform the analysis. Comparing the results of SPC and multilevel analysis, we investigate whether we identify the same critical interviewer and respondent characteristics? In addition, we try to integrate information from multilevel analysis into SPC to identify problematic interviewers and respondents after controlling for the factors that influence the interview duration (significant characteristics of respondents and interviewers found in multilevel analysis).

The results of this paper can lead to improved use of paradata during data collection process to detect possible measurement errors, and has the potential to indirectly reduce measurement errors as well.


3. The use of auxiliary and event data in tracking an inhomogeneity of substantive results from surveys in cross-national studies. Example of ESS.
Ms Teresa Zmijewska-Jedrzejczyk (Institute of Philosophy and Sociology, Polish Academy of Sciences)

When analyzing data from cross-national surveys one must be aware of country differences due to the necessity to control homogeneity of that data. Fulfil this postulate remains challenging for researchers. The aim of the paper is to present a link between paradata and data quality in order to trace the homogeneity of the substantive results from surveys and to detect possible context effects.

Analyzing the results of the survey using fieldwork’s date of the interview gives an opportunity to derive variables prone to be an inhomogeneous. Among long list of the different surveys’ effects corresponding to the fieldwork’s date and the length of the survey period, context effects of survey climate (Loosveldt & Joye, 2016) and sudden transitory issues or long-lasting events’ (Michalopoulou, 2015) are relevant.

Results of analyzing data from ESS show that fieldwork’s date of the interview is one of the important factor during the assessment of data homogeneity. However, the nature of such effects have country specific patterns.
Further exploration of the context of issues, events and claims salient during fieldwork period are helpful to formulate hypotheses what kind of transitory factors might affect responses to the questions concerning opinions or behaviors. Outcome of that insight gives evidence that variables prone to be an inhomogeneous concern various topics, not necessarily directly connected to the topic of issues salient during fieldwork period.

Applying the use of paradata together with events data helps evaluating the data quality better. For the future, result of such evaluations gives an opportunity to improve fieldwork procedures, in order to maximize standardization, which as one of a key premise of survey’s research.


4. Using timestamps to monitor fieldwork and evaluate data quality: Experiences from a household survey in Tanzania
Dr Johanna Choumert Nkolo (Economic Development Initiatives (E.D.I.), High Wycombe, United Kingdom / Bukoba, Tanzania)
Mr Henry Cust (Economic Development Initiatives (E.D.I.), High Wycombe, United Kingdom / Bukoba, Tanzania)
Mr Callum Taylor (Economic Development Initiatives (E.D.I.), High Wycombe, United Kingdom / Bukoba, Tanzania. Email: c.taylor@surveybe.com)

BACKGROUND
Monitoring data quality is an essential part of any serious, large-scale data collection process. It helps to identify errors and inconsistencies in the data and results in a cleaner, more accurate dataset. The increasing use of Computer Assisted Personal Interview (CAPI) softwares has increased the possibilities of survey managers and researchers. CAPI allows to capture a wide range of paradata which can be used in complement to the questionnaire data, metadata and auxiliary data.
Timestamps record the exact time when a selected question is answered. They provide useful information throughout the interview that can be used in various ways.

OBJECTIVES
The purpose of this contribution is threefold. First, we aim at explaining how timestamps should be implemented in a CAPI software, using the example of surveybe software. Second, we show how they can be used before, during and after data collection. Third, we show that using multiple timestamps is better than using ‘start-time’ and ‘end-time’ timestamps only.

METHODS
In this paper, we use timestamps collected for a 800 household survey conducted in Tanzania in November and December 2016. For this 1-hour questionnaire, we collected over 20 timestamps. The majority of these timestamps were completely invisible to the enumerator and information was used during the 3 phases of fieldwork (i) preparation (ii) real-time fieldwork monitoring, and (iii) evaluation of data quality.
During the preparation phase, we used timestamps for the training of interviewers when they performed mock-interviews and piloting. This allowed us to have a thorough breakdown of the interview length by section. Sections taking too much time or raising difficulties for respondents can be clearly identified before interviewers reach the field.
During fieldwork, we used timestamps to monitor day-to-day activities and planning of field teams. They were also used to monitor interviewers’ performance during each section of the interview and address issues immediately.
During the data cleaning phase, we used timestamps to highlight the types of questions for which respondents took more time to respond. For instance, we found that overall respondents took more time to answer perceptions questions compared to more objective questions.

RESULTS
We show that sections containing different types of questions can have varying lengths per question. Overall, our results corroborate the importance of collecting and analyzing paradata to monitor fieldwork and ensure data quality. Moreover, they show that inserting several timestamps instead of the usual ‘start-time’ and ‘end-time’ ones adds value to the data monitoring process.


5. Paradata as an aide to questionnaire design: Improving quality and reducing burden
Mrs Emma Timm (Office for National Statistics, UK)
Mr Jordan Stewart (Office for National Statistics)
Mr Ian Sidney (Office for National Statistics)

The UK Office for National Statistics (ONS) is moving its business and social surveys, and the Census, to electronic modes of data collection. This focus on electronic data collection (EDC) has presented opportunities to access additional data; namely paradata.

Paradata are automatic data collected about the survey data collection process, captured during EDC, and include call records, interviewer observations, time stamps, and other data captured during the process.

Analysis of paradata supports continuous improvement of questionnaires and the wider services that support them. Analysis of paradata and follow-up of respondents to whom the paradata relate could reduce the time and cost burden on those responding.

During an online pilot of one of ONS’ business surveys, the analysis of paradata assisted in the identification of issues with the questionnaire and associated processes for completion. In one case of interest, a respondent showed a pattern of moving through the survey; reaching the last question; exiting; and then, at a later date, going back through all questions, before submitting.

Matching the respondent to call centre records, it was found that they had called to request a specimen questionnaire. This was to enable the respondent to collate the data required to complete the questionnaire. The only way of previewing the questions, otherwise, was to enter false data to bypass validation. This insight demonstrated how ONS could reduce response burden by providing an option to preview the questionnaire up-front. Solutions will be designed and tested before implementation.

This presentation will discuss case study examples, which demonstrate the value of paradata analysis in improving data quality and reducing response burden. Plans to recruit respondents to take part in follow-up research will also be discussed, along with associated ethical considerations concerning data-linkage. It is planned that semi-structured interviews will be used to examine the reasons behind interesting paradata; and experiences of registering for and completing the survey online.