Big Data and Survey Research |
|
Convenor | Mr Yamil Nares (Institute for Social & Economic Research (ISER) ) |
Coordinator 1 | Dr Tarek Al Baghal (Institute for Social & Economic Research (ISER)) |
Big Data involve massive amounts of high-dimensional and unstructured data that bring both new opportunities and new challenges to the data analyst. But, Big Data are often selective, incomplete and erroneous. New errors can be introduced downstream as the data are cleaned, integrated, transformed, and analyzed. We present a total error model for Big Data that enumerates many of the risks of false inferences in the analysis of Big Data. Some approaches for minimizing these risks gleaned from more than a century of experience with analyzing and processing survey data will be described.
With the world-wide spread of the internet in the 1990s, the conduction of web or e-mail surveys became popular in research. Although these surveys provide fast data collection and reduced costs, results may suffer from biases due to the survey mode. While a variety of studies concerning mode effects in household or individual surveys exists, only less is known in case of business surveys. Our results show that e-mail or web surveys reduce nonresponse and are more likely used by larger firms which operate in technology-related business areas.
The extraction of response times from log-data is less straightforward when multiple questions are placed on a single screen and if the instrument design, like the one used for the computer-based context assessment of PISA 2015, offers degrees of freedom, e.g., individual sequences, free navigation and item review. Different time indicators (e.g., time for loading, reading, confirmation, …) will be defined theoretically, evaluated and then used to investigate empirically similarities and differences in the response behaviour of about 15,000 teachers and about 120,000 students in the PISA 2015 Field Trial among 62 countries.
Statistical data, often, are disseminated with different standard from a web platform (website, webservices) . Normally, data are store into a relational database or datawarehouse. Each statistical data sources have a different own relational database schema and DBMS (Oracle, Mysql, Postgres, MSSQL). Each dataset (data and meta data) are disseminate in many way (Restful, soap) and many format (XML, JSON, CSV). Statistical data and metadata can be represent as an object consisting of a key/value pair.
This presentation will discuss how micro-level data linking survey responses to social media can be created, including issues of data linkage consent, ethics, collecting the social media data, and methods for converting these into meaningful measures. The necessary text analytic methods can capture and create a number of potentially useful variables, including improved proxy variables. These measures can then be used in methods for nonresponse adjustment. The presentation will discuss how to evaluate these measures and how these fair in nonresponse correction. As an example, measures from aggregated Twitter data are compared and correlated with Understanding Society data.