Occupation coding 2 |
|
Chair | Professor Matthias Schonlau (University of Waterloo ) |
Coordinator 1 | Mr Malte Schierholz (IAB ) |
In Statistics Portugal, Economic Activity and Occupation are two of the most commonly collected variables when it comes to social surveys. Their classification is performed according Classificação Portuguesa das Atividades Económicas Rev.3 (CAE), and Classificação Portuguesa das Profissões 2010 (CPP).
Both classifications have a significant number of categories which itself poses some challenges to the coding process. Per year, approximately 100.000 responses needs coding for both these variables.
Nonetheless, the biggest challenge still remains the data to classify. In CAPI and CATI interviews, these questions are collected in as much detail as possible, based on descriptions of the occupation/main tasks and main activity/what is done in your work place. This means that data is collected with an open-ended question and without any pre-coded aid or any kind of input restriction. This results in a high diversity of textual descriptions. The same word can be written in a multitude of variations due to spelling errors, (mis)use of abbreviation, caps, accentuation or hyphenation, just to name a few. This is quite understandable since interviewers input this data “on-the-fly”.
Until now coding is exclusively done by a team of coding experts, some with more than 10 years experience. Knowing that manual coding is nowadays considered both time-consuming and error prone, Statistics Portugal started exploring automatic coding for the purpose of defining the best solution for the implementation in social surveys.
This paper / presentation will address three topics: (1) Create and expand existing dictionaries, (2) Make automatic coding accessible, and (3) Monitor performance and provide useful data to validate results and improve performance.
Taking advantage of having a database with more than 500.000 manually coded data, collected from a 5-year period, it was possible to compute distance metrics between strings from the dictionaries and strings written by interviewers to describe both Economic Activity and Occupation. For this purpose, it was used the stringdist R package by M.P.J. van der Loo (2014). This package provides, among others algorithms, the optimal string alignment distance (an extension of the Levenshtein distance that allows for transpositions of adjacent characters). The algorithm performed very well and was possible to expand 30% to 40% the original dictionaries with strings of data written by interviewers.
In a subsequent step, dictionaries were expanded with data from that previously coded answers.
In order to make automatic coding accessible, an R package - INEautoclass – was created to classify both Economic Activity and Occupation with a 2 or 3 digit level of detail. Dictionaries themselves are a part of the package as well useful documentation. This package is able to code 57% of all answers that come from the Labour Force Survey.
Before entering in production mode it’s vital to monitor performance and provide useful data to validate results and improve performance. Hence an RMarkdown report continuously provides information on coding percentage, error rates and detailed information when automatic coding does not match human coding.
Occupation coding refers to coding a respondent's text answer into one of many hundreds of occupation codes. To date, occupation coding is still at least partially conducted manually at great expense. We propose three methods for automatic coding: combining separate models for the detailed/aggregate occupation codes, a hybrid method combining a duplicate-based approach with statistical learning, and a modified nearest neighbor approach. Using data from the German General Social Survey (ALLBUS), we show the proposed methods improve on both the coding accuracy of the underlying statistical learning algorithm and the coding accuracy of duplicates where duplicates exist.