Combining Data Science and Survey Research to Improve (Training) Data Quality |
|
Coordinator 1 | Dr Christoph Kern (LMU Munich) |
Coordinator 2 | Dr Ruben Bach (University of Mannheim) |
Coordinator 3 | Mr Jacob Beck (LMU Munich) |
Coordinator 4 | Dr Stephanie Eckman (RTI International) |
The data science world has made considerable progress in the development of machine learning techniques and algorithms that can make use of various forms of data. These methods have been picked up by survey researchers in various contexts to facilitate the processing of survey, text and sensor data or to combine data from different sources. Ultimately, the hope in these applications is to make use of various data sources efficiently while maintaining or improving data quality.
The data science field has paid less attention to creation and use of high-quality training data in the development of algorithms. AI researchers, however, are increasingly realizing that insights from their models are constraint by the quality of the data the models are trained on. Selective participation or measurement error in data generation and collection can considerably affect the trained models and their downstream application. Misrepresentation of social groups during model training, for example, can result in disparate error rates across subpopulations.
This session brings together methodological approaches from both data science and survey science that aim to improve data quality. We will discuss how both disciplines can learn from each other to improve representation and measurement in various data sources. We welcome submissions that demonstrate how data science techniques can be used to improve quality in data collection contexts, e.g. for
- utilizing information from digital trace data
- improving inference from non-probability samples
- integrating data from heterogeneous sources
The session will also feature studies that utilize the survey research toolkit for improving training data quality, e.g. to
- assess and improve the representativity of training data
- collect training data with better labels
- improve the measurement of the input used for prediction models.