How surveys and big data can work together 1 |
|
Chair | Dr Mario Callegaro (Google ) |
Coordinator 1 | Dr Yongwei Yang (Google) |
This paper considers the strengths and weaknesses of Big Data, survey data and “unified” data (i.e., the combination of survey data with Big Data) through the lens of a total quality framework. This perspective allows data producers to more objectively and comprehensively compare the whole range of quality dimensions and error sources across data sources to identify their relative strengths and weaknesses. Researchers can then use this information to identify methods for combining the different data sources in ways that leverage their positive and diminish their negative attributes. The end product is a unified data set with superior total quality than any of its source components. The paper presents interesting results from a survey conducted at the recent International Total Survey Error Workshop (ITSEW, 2016 held in Sydney, Australia) which sought to quantify the quality aspects of Big Data and survey data across ten quality dimensions. The total quality framework is then applied to a unified data set developed by the authors for a key variable in the 2015 U.S. Residential Energy Consumption Survey (RECS 2015). In this study, survey data on dwelling unit area was combined with data from an online real estate database to reduce measurement error, nonresponse, sampling error and possibly other errors. The paper reports the results investigating whether the unified data estimates were of superior quality than estimates derived from these individual data sources.
In the past couple of years the demand for big data in social sciences has increased tremendously. Especially in survey operations, big data are considered a valuable resource to mitigate errors in the survey process and empirical analyses and vice versa. On the on hand, surveys allow us to assess where big data show coverage problems. For example, using the American Community Survey information on internet, smartphone or computer usage shows spatial areas where a specific part of the population doesn't have internet access. These populations are most likely not included in for example media data or tracking data. On the other hand, studies show that big data can be used to improve survey sampling frames, to assess nonresponse error, nonresponse adjustment or responsive designs. For example, open data such as reported crime, access to public recreation facilities (parks, pools, etc), online state administrative records for homeowners, wifi hotspots, etc. can be used to collect more information for non-respondents. This paper discusses the potential of combining survey and big data along the Total Survey Error and Total Big Data Error Framework by showing practical applications in the field of survey research.
The traditional approach to design a survey with all the necessary information for the production of a deliverable, be it a publication, an indicator for the national purpose, or a deliverable to fulfil our national obligations to Eurostat or other institutions is no longer maintainable. Life has changed in the last years not only on the perception of how the use of administrative data could lessen the burden on the respondents and the survey costs but also because the gathering of information through sensors and other automated devices can constitute a useful resource.
Many questions arise when we try to use data, be it administrative or sensor generated, to produce official statistics. Most of them because the data was not collected with statistical purposes and methodologically its use can be complex as it does not meet statistical standards on concepts or definitions. There are also selectivity problems and usually no guarantee on the source continuity and stability.
When a Big Data holder decides to change definitions, collect different data or entirely stop their data collection the statistical offices usually have no leverage to prevent the loss of such data. For this reason our approach to Big Data that we don’t collect ourselves, through Web Scraping for example, has been cautious and gradual.
After the census 2011 operation a national address data base was built with the results from the buildings census survey conducted in parallel with the population census. Since then this database has been enriched with administrative data coming from the city councils and several surveys directed targeted at construction promoters and such. We are now considering its enrichment with a big data source on the electricity consumption. This information would be able to tell us if a household is primary or secondary, once a threshold for electricity consumption is established.
Due to the aforementioned reasons we are looking only at a very small portion of the big data available, since the smart meters can generate a huge amount of data, but it can change without much time notice. To make an investment on an unstable source would not be a sound strategic decision at this point so we are focusing on the address to charge which is the main interest of our Big Data Source holder and the amount of electricity that should be charged.
Of course even in this case we try to collaborate with the data holders to introduce some safeguards at the same time that we try to develop the new methodologies that will allow us to combine the sources. Different quality checks to the data extracted from Big Data sources have to be applied and the combination of this quality checks and new methodologies are addressed in the specific case of the national addresses’ data base that we present you in this paper.
We present first results from a major online panel tracking study that aims to investigate media exposure and opinion formation in an age of information overload. The overarching project asks if online sources and social media platforms exacerbate segregation, polarization, and inequality in political knowledge and behavior, and what role weak ties play in moderating citizens’ information diet. To that end, we study how offline events are translated into media coverage, people decide to consume or avoid coverage, and how these decisions affect attitudes and behavior. We combine passive metering technology to capture the online media consumption of a representative sample of individuals in Germany and the United States, draw on machine learning and natural language processing methods to estimate the topic and ideological slant of each separate piece of content consumed, and in parallel directly survey the panelists at regular intervals to monitor changes in issue attention, opinion, and political knowledge.