All time references are in CEST
Automatic coding with deep learning: feature extraction from unstructured survey data |
|
Session Organisers | Dr Arne Bethmann (SHARE Germany and SHARE Berlin Institute) Ms Marina Aoki (SHARE Germany and SHARE Berlin Institute) |
Time | Tuesday 18 July, 09:00 - 10:30 |
Room |
Standardised, closed-ended questions are the most common, but by no means the only form of data collected in surveys. Many surveys collect data such as open-ended text, images, or even audio and video. This so-called unstructured data doesn’t lend itself easily to analysis e.g. in the social or life sciences, but first needs to be coded into simpler variables, a process commonly called feature extraction in the machine learning literature. Examples would be coding occupations from verbatim text responses into a standard classification, drawings from cognitive assessments that need to be scored as correct or incorrect, recorded speech patterns that indicate some form of neuro-generative disease, or the classification of behaviour recorded on video.
Collecting this data is not new, but traditionally it has had to be manually coded by human coders. This is a labour-intensive, expensive and, depending on the problem, error-prone task. With the development of machine learning approaches, and in particular the recent boom in deep neural networks due to the availability of massive amounts of computing power and training data, these tasks can increasingly be automated, making the collection of unstructured data more cost-effective - and in some cases, financially viable for the first time - and ideally improving data quality at the same time.
This session will look at methods and applications in survey research that use deep learning approaches to extract features from unstructured survey data in order to obtain more easily analysable variables. We aim to discuss both the benefits and limitations of these methods and address key questions such as: Where are these methods useful? Where are traditional machine learning methods sufficient or even more effective? How do you get enough high quality training data? How do you assess the quality of the predictions?
Keywords: deep learning, machine learning, AI, automatic coding, scoring, rating, unstructured data, feature extraction
Ms Lina-Jeanette Metzger (Institut für Arbeitsmarkt- und Berufsforschung (Institute for Employment Research)) - Presenting Author
Dr Michael Stops (Institut für Arbeitsmarkt- und Berufsforschung (Institute for Employment Research))
Multi-label classification problems on big, unstructured data with a large number of labels can be solved by two methods: ontology-based techniques or supervised machine learning. Whereas completeness is a severe challenge for ontologies, the primary challenge for machine learning is the need for high-quality training data with a sufficient number of observations for each label. We propose a combination of both approaches to leverage their respective advantages and to reduce their drawbacks.
By framing the multi-label problem as a series of independent binary classification tasks, we evaluate and compare an ontology-based approach with a machine learning-based approach on a binary classification problem. Our use case involves identifying job advertisements related to sustainable hydrogen technologies. Our findings demonstrate that combining both approaches yields the best results. Specifically, ontology-based labeled data serves as training data for the machine learning approach, while machine learning-identified job advertisements contribute to the discovery of additional terms for refining the ontology.
Each approach faces additional challenges. The machine learning approach must address a highly imbalanced dataset, as less than 0.1 % of job advertisements are ontology-based identified as related to sustainable hydrogen technologies. On the other hand, the ontology-based approach typically operates with a positive framing – if a term from the ontology is found in the text, the label is assigned. To mitigate the risk of false positives, we developed an algorithm that incorporates a secondary ontology of negative search terms. If a positive search time co-occurs with a negative search term, the label is withheld, improving the precision of the ontology-based approach.
To assess the quality of both approaches, we created a manually labeled dataset containing a significant number of job advertisements related to sustainable hydrogen technologies.
Dr Arne Bethmann (SHARE Germany, SHARE Berlin Institute) - Presenting Author
Mr Ming Yuan Lee (SHARE Germany, SHARE Berlin Institute)
Mr Knut Wenzig (SOEP, DIW Berlin)
Socio-economic research often requires detailed information on individual occupations in order to study aspects such as occupational prestige, hazards, or gender disparities. Traditionally, this data collection has involved manual coding of respondents’ job descriptions into specific classifications. However, for reasons of efficiency and quality, this process is increasingly being automated using machine learning. While these automated approaches have shown success with large, high-quality datasets in single languages, multilingual occupation coding remains a challenge, especially for low-resource languages with limited training data.
This paper investigates the potential of transfer learning to improve multilingual occupation coding. We explore how models can leverage the predictive power of other languages within large multilingual datasets. Using extensive training data from the German Socio-Economic Panel (SOEP) and the Survey of Health Aging and Retirement in Europe (SHARE), we fine-tune several pre-trained language models (DistilBERT) to predict one- and four-digit ISCO08 codes, representing simple and complex classification tasks, respectively.
We compare the prediction accuracy of models trained exclusively on country-specific data against those trained on the full multilingual dataset. In addition, we examine the impact of boosting a specific language, in this case German using SOEP data, on the accuracy for this and other countries. Our results aim to demonstrate the effectiveness of transfer learning in improving multilingual occupation classification. Specifically, for cross-country studies such as SHARE, we provide insights into whether and how we can mitigate the challenge of coding occupations with high accuracy in low-resource languages.
Professor Mengyao Hu (UTHealth Houston) - Presenting Author
Professor Yi Lu Murphey (University of Michigan)
Ms Qin Tian (University of Michigan)
Professor Laura Zahodne (University of Michigan)
Professor Richard Gonzalez (University of Michigan)
Professor Vicki Freedman (University of Michigan)
Alzheimer’s disease and related dementias (ADRD) significantly impact older adults' quality of life and pose challenges to public health systems. The clock-drawing test (CDT) is a widely used dementia screening tool due to its ease of administration and effectiveness. However, manual CDT-coding in large-scale studies can be time-intensive and prone to coding errors. In this study, we developed a deep learning neural network (DLNN) system based on Vision Transformer (ViT) to automate CDT-coding using continuous scoring and investigate its value for dementia classification, compared to ordinal CDT scores. Using a nationally representative sample of older adults from the National Health and Aging Trends Study (NHATS), we trained ViT models on CDT images to generate both ordinal and continuous scores. We compared the predictive power of these scores for dementia classification and identified demographic-specific thresholds. Continuous CDT scores provided more precise thresholds for dementia classification than ordinal scores, varying by demographic characteristics. Lower thresholds were identified for Black individuals, those with lower education, and those aged 90 or older. Compared to ordinal scores, continuous scores also allowed for a more balanced sensitivity and specificity. This study offers valuable practical implications for national survey research, demonstrating that DLNN models can effectively automate CDT scoring, minimize coding errors, and produce more nuanced continuous scores that are not possible to reliably obtain through manual coding.
Ms Marina Aoki (SHARE Germany, SHARE Berlin Institute) - Presenting Author
Mr Arne Bethmann (SHARE Germany, SHARE Berlin Institute)
Cube drawing is a common task in dementia screening protocols and has been part of the cognitive functioning module in the SHARE survey since wave 8. Deep learning models have achieved around 80% accuracy in classifying these drawings as correct, partially correct, or incorrect (see Bethmann, Aoki, Hunsicker & Weileder, 2023). However, model performance is hindered by dataset diversity and subjectivity in annotations, particularly within the ‘partially correct’ class, where noisy labels complicate classification.
Curriculum learning, inspired by human learning processes, offers a promising solution by prioritising simpler samples during early training stages and progressively introducing more complex ones. This approach is particularly effective for small datasets with noisy labels, as it enables models to build a solid foundation before tackling more challenging cases. In this paper, we explore curriculum learning strategies to improve cube drawing classification within the German part of the Wave 8 SHARE data, which contains only about 2,000 cases with relatively noisy interviewer scores.
Specifically, we investigate and compare two methods for determining annotation difficulty: 1) annotator agreement, which reflects consistency between annotators, and 2) training loss from a reference model, as a proxy for sample complexity. We also evaluate two pacing strategies for sample presentation: 1) an extension from binary to multi-class classification by initially excluding the ‘partially correct’ class, and 2) a step-based pacing strategy where all classes are introduced in stages.
Miss Xinyi Chen (University of Michigan-Ann Arbor, Institute for Social Research) - Presenting Author
Mr Mao Li (University of Michigan-Ann Arbor, Institute for Social Research)
Nonsense responses—textual answers that are incoherent or irrelevant to the posed survey question (e.g., “I don’t know,” “Nothing in particular,” “It is what it is”)—can undermine data quality by introducing unique forms of measurement error. While these responses are regarded as a type of satisficing behavior by definition, most research on satisficing has focused on structured survey items, leaving the dynamics of nonsense responses in open-ended questions relatively unexplored. To fill this gap, we extend the Total Survey Error framework to textual data, investigating (1) the prevalence and patterns of nonsense responses, and (2) their relationship to established satisficing behaviors such as straightlining and speeding.
Drawing on a web panel survey (N=1,793) and the Survey of Consumer Attitudes (N=506) across three languages (English, Spanish, and German), we first employ large language models (LLMs) to detect whether open-ended responses adequately address each question. Based on that, we summarized the patterns of nonsense responses with LLMs. We then use multi-level regression to examine whether these potentially problematic responses occur systematically, incorporating respondent-level factors (e.g., sociodemographics) and question-level characteristics (e.g., topic, wording complexity). Finally, we assess how nonsense responses intersect with traditional satisficing indicators—including straightlining, overuse of “don’t know,” and speeding—to clarify whether they reflect genuine respondent opinions or minimal cognitive effort.
Our findings reveal the nontrivial frequency of nonsense responses and highlight their distinct distribution and correlates across surveys. By expanding the contexts in which measurement error is conceptualized, this study enriches our understanding of satisficing in textual responses and underscores the importance of robust survey design to enhance data integrity.