ESRA logo

ESRA 2025 sessions by theme

Back to Overview of Sessions

Automatic coding with deep learning: feature extraction from unstructured survey data

Coordinator 1Dr Arne Bethmann (SHARE Germany and SHARE Berlin Institute)
Coordinator 2Ms Marina Aoki (SHARE Germany and SHARE Berlin Institute)

Session Details

Standardised, closed-ended questions are the most common, but by no means the only form of data collected in surveys. Many surveys collect data such as open-ended text, images, or even audio and video. This so-called unstructured data doesn’t lend itself easily to analysis e.g. in the social or life sciences, but first needs to be coded into simpler variables, a process commonly called feature extraction in the machine learning literature. Examples would be coding occupations from verbatim text responses into a standard classification, drawings from cognitive assessments that need to be scored as correct or incorrect, recorded speech patterns that indicate some form of neuro-generative disease, or the classification of behaviour recorded on video.

Collecting this data is not new, but traditionally it has had to be manually coded by human coders. This is a labour-intensive, expensive and, depending on the problem, error-prone task. With the development of machine learning approaches, and in particular the recent boom in deep neural networks due to the availability of massive amounts of computing power and training data, these tasks can increasingly be automated, making the collection of unstructured data more cost-effective - and in some cases, financially viable for the first time - and ideally improving data quality at the same time.

This session will look at methods and applications in survey research that use deep learning approaches to extract features from unstructured survey data in order to obtain more easily analysable variables. We aim to discuss both the benefits and limitations of these methods and address key questions such as: Where are these methods useful? Where are traditional machine learning methods sufficient or even more effective? How do you get enough high quality training data? How do you assess the quality of the predictions?