Database Lookups for Socio-Demographics in Surveys |
|
Session Organisers | Dr Silke Schneider (GESIS - Leibniz Institute for the Social Sciences) Ms Stephanie Stuck (Max Planck Institute for Social Law and Social Policy, SHARE) |
Time | Thursday 18th July, 09:00 - 10:30 |
Room | D19 |
In the age of digitization, rapid changes take place regarding what can be done in surveys to collect and code empirical data for administrative and research purposes. One development promising reduced office coding, less harmonization logistics and thus cost savings is to link survey questionnaires with large databases to allow the coding of survey variables during the interview.
Such procedures are most interesting for the measurement of respondents’ socio-demographics because they usually cannot be measured with scales but often require more or less detailed and elaborate categorical classification systems. Database lookups can either replace open questions that need to be post-coded, such as occupation, industry, country (e.g. of birth) or language (e.g. mother tongue or language most spoken at home), or long-list questions like education or fields of study.
What is less known is a) the effort required by survey organizations to implement such technologies in various kinds of surveys, and b) how interviewers and respondents in various survey modes experience such procedures. Are database lookups more or less burdensome than standard questions? Do they require more or less interview time? Is the quality of the resulting data comparable across the two approaches? Can the procedures possibly also be used for proxy measurements (e.g. parental occupation or education)? And finally c), what steps need to be taken to implement such procedures in surveys at a large scale?
This session invites presentations of past developments, ongoing research and promising designs to make use of databases lookups for capturing socio-demographics in surveys, and evaluate the resulting data quality. Contributions regarding the use of databases for post-survey processing of socio-demographic information will also be considered in this session. While large cross-national surveys are, because of their requirement to produce cross-nationally comparable data, most interested in such developments, evidence from small-scale or national projects are also welcome.
Keywords: database, questionnaire, socio-demographics, measurement, coding, harmonization
Dr Silke Schneider (GESIS - Leibniz Institute for the Social Sciences) - Presenting Author
With this presentation, we would like to showcase tools for measuring education in computer-assisted surveys that were developed in the CAMCES and SERISS projects. First, the CAMCES tool, which measures respondents’ highest educational qualification, and second, a new tool, which measures the field of education and training. Both tools are based on international multilingual databases, including international coding schemes for educational attainment and fields of education and training based on UNESCO’s standard classification of education. While the CAMCES tool provides context-sensitive response categories (i.e. corresponding to the country in which respondents received their education), fields of education and training are offered in many languages, and are offered to respondents in the language of the survey. In the first part of the presentation, we will briefly demonstrate these tools.
Furthermore, we will report results from a validation of the CAMCES tool on measuring the highest educational qualification in recent surveys of migrants. The CAMCES tool has already been implemented in the migration and refugee samples of the German Socio-Economic Panel (SOEP) since 2015. Apart from choosing their educational qualification through this tool, migrants additionally were asked to report the years they spent in school and their educational attainment using non-context sensitive items with generic descriptions of educational levels. We validate these different instruments measuring migrants’ education using a construct validation predicting their socio-economic status and their proficiency of the German language. Doing so, we want to assess the quality of the data resulting from the different measurement instruments.
Mrs Antje Rosebrock (MZES Mannheim) - Presenting Author
Mr Malte Schierholz (Institute for Employment Research)
Occupation is a crucial variable for social research, with applications in many different areas of research. It is used to capture individuals’ tasks and duties, to measure health risks caused by occupational hazards or to measure social prestige.
Traditionally, collecting information about respondents’ occupation(s) in social surveys is done by asking two or three open-ended questions about the occupational activities performed by the respondent, complemented by additional information from closed-ended questions. After the interview, the information is usually coded by professional coders according to official occupational classifications (in the case of Germany: ISCO-08 and KldB 2010). This coding process is not only expensive due to longer interview durations and manual coding efforts but also error-prone because responses are often ambiguous and incomplete, making them difficult to code according to the detailed official definitions.
Schierholz et al. (2017) have presented an alternative approach that allows occupation coding during the interview. A supervised learning algorithm is used to predict respondents’ job categories. In the interaction with the interviewer, a respondent can choose his or her occupation based on the task descriptions offered based on the algorithm. A prior version of the tool has been evaluated in an empirical context and performed reasonably well.
In our study, we test a new version of the tool in a face-to-face survey with about 1,000 respondents. Our research goals are, first, to analyze the quality of this new approach compared to manual coding and, second, we use behavior coding to analyze the interaction between the interviewer and the respondent with regard to problems arising in the survey interaction. Results and directions for future improvements are presented.
Schierholz, M., Gensicke, M., Tschersich, N. and Kreuter, Frauke (2017) Occupation Coding during the Interview. Journal of the Royal Statistical Society: Series A (Statistics in Society), 181, 379-407.
Ms Linda Pawlicki (University of Luzern)
Professor Caroline Roberts (University of Lausanne) - Presenting Author
Dr Michèle Ernst Stähli (FORS)
One of the many advantages of using web-based methods of data collection in survey research is the possibility to simultaneously collect and code empirical data, eliminating the need for manual data processing procedures. This advantage is particularly important in the measurement of respondents’ socio-demographic background, where pre-coded variables involving a large number of categories – such as country of birth, languages spoken, educational qualifications or occupation – can be automated by linking survey questionnaires with databases to allow the respondent (or interviewer) to select the appropriate response category directly. Such database look-ups are appealing for several reasons. For one, they can replace open-ended questions and potentially complex office-based coding procedures, and so potentially reduce data processing errors and survey costs. Respondents may also, arguably, be more accurate at coding themselves than interviewers or office-coders. Nevertheless, where highly complex coding procedures are necessary coding accuracy not only depends on the ability of the coder to select the correct code, but also on the search algorithm integral to the system, which determines how easily respondents arrive at the correct code. In this paper we present the results of a study designed to compare different web-based database look-ups for occupation coding (employing different search algorithms) with standard office-coding procedures, to assess the extent of coding error associated with each. As well as involving cross-survey comparisons, we also compare modes of data collection to assess the relative advantages and disadvantages of self- vs. interviewer vs. office-based coding using database look-ups, as well as mode effects on responses to the open-ended questions conventionally used in manual coding procedures for assigning ISCO codes. We use a variety of indicators of coding quality to document the extent of processing error associated with each method and draw conclusions about the impact on resultant survey estimates.
Ms Stephanie Stuck (SHARE) - Presenting Author
The Survey of Health Ageing and Retirement in Europe (SHARE) developed and implemented an in field coding tool – the so called job coder - to deal with the well - known challenges in the field of occupation coding in international contexts. The data base look up tool was first used in the fifth wave of SHARE to measure and code occupation for respondents as well as their parents. A large multi-lingual data base with thousands of entries of job titles is used in the field to measure occupations and code them into the ISCO coding scheme. Within the context of the SERISS project this tool was further developed, since this joint project of several European surveys aimed at developing tools to harmonise measurement and coding in an international context. Based on previous experience from SHARE the tool as well as the data base were improved and went into the field again in 28 countries in the seventh wave of SHARE. This talk focuses on the experience we made using this tool when it comes to e.g. implementation in an international context, translations, interviewer training and feedback and last but not least data outcome.
Dr Kea Tijdens (University of Amsterdam) - Presenting Author
Occupation is a key variable in socio-economic research, used in a wide variety of studies. In surveys it is mostly asked using an open text format with office coding. Alternatively, web surveys and computer-based face-to-face surveys allow the use of an occupation database with coded occupational titles. Compared to office coding, a database is advantageous, because all occupational titles are measured at the same level of detail, unidentifiable titles are absent, and so are coding costs and timelags. Several single-country surveys have applied machine learning algorithms, requiring a huge set of manually coded verbatim answers. For multi-country surveys, however no such sets exist. From 2004 onwards for the multi-country WageIndicator websurvey on work and wages, the author has gradually developped a multi-country occupational database, coded ISCO08 at 4 digits. Thanks to the SERISS project (EU H2020 no 654221) the database could be extended to over 4,000 occupational titles, translated into 47 languages. The paper discusses the choices made during the design of the database, concerning issues related to (a) the measurement of occupational titles at 5 digit rather than 4 digit level, (b) the very long tail of the occupational distribution (any labour force can have 10,000s titles), (c) the country differences in occupational composition of the labour force and the implications for the source list of occupations, (d) the country-specific processes of upgrading and downgrading, (e) the absence of job descriptions and the empirical measurement of tasks in occupations, (f) titles with an overlapping meaning (e.g. bookkeeper and accountant), (g) higher educated labour force has more detailed occupational titles for higher skilled occupations, whereas a lower educated labour force has more detailed titles for lower skilled occupations.