How surveys and big data can work together 2 |
|
Chair | Dr Mario Callegaro (Google ) |
Coordinator 1 | Dr Yongwei Yang (Google) |
We show how to combine a US-wide survey with Google Trends data to yield out-of-sample predictions for finer-grained geographies such as cities, market areas, and states. This post-stratification technique yields predictions with uniform, small confidence intervals.
First, survey responses are collected each with the geography of the respondent. Combine with Google Trends data, indexed by geography and by search vertical, and then perform variable selection to determine which subset of verticals best model the survey responses. Variable selection takes advantage of the geographic differences in Google Web Search; e.g., the number of searches for each search category may differ in university towns from towns with strong vehicle manufacturing. Typically, a handful of verticals model the survey responses.
A survey's verticals themselves provide insight. E.g., those who supported Barack Obama before the 2012 U.S. presidential election were interested in books, basketball, student loans, and hip-hop but negatively interested in Christianity and coupons.
Using Google Trends data at the geographic level, out-of-sample predictions for fine-grained geographies' responses such as cities, market area, and states can be computed. Obtaining this level of prediction just using a survey would greatly increase the survey cost. We call this "survey amplification".
Marketers can identify target audiences through amplifying survey results to yield, for each geo, the fraction of inhabitants likely responsive to the desired survey answer. Combining these predictions with the cost of advertising yields cost-effective target areas at the cost of a few hundred or a few thousand dollars to survey.
If one has market sales data broken down by geos, amplification can model sales geographically. Underperforming markets, i.e., geos with lower actual sales than predicted, may need special treatment to improve sales.
Amplification to obtain predictions, vertical models, identification of underperformers, and extrapolations preserves user privacy. Survey results need only be denoted with the associated geographies. Google Trends data aggregates trillions of web searches so privacy is preserved. It also is inexpensive, requiring just a few hundred or a few thousand dollars to obtain survey results.
National Statistical Institutes (NSIs) often use large datasets to estimate population tables on many different aspects of society. A way to create these rich datasets as efficiently and cost effectively as possible is by utilizing already available administrative data. When more information is required than already available, administrative data can be supplemented with survey data. Caution is advised as both surveys and administrative data can contain classification errors.
Therefore, we developed a method which combines multiple imputation (MI) and Latent Class (LC) analysis (MILC) to estimate the number of classification errors in combined datasets and simultaneously imputes a new variable which takes the uncertainty caused by these classification errors into account. With this method it is possible to obtain estimates that are consistent and that take impossible combinations, such as pregnant males, into account. Such impossible combinations are often referred to as edit rules. Taking edit rules into account is especially useful within official statistics since cells in cross tables that represent a combination of scores that is in practice impossible should contain zero observations.
However, the MILC method only obtains consistent estimates and only takes edit rules into account for variables that were included in the LC model. For variables that are not included into the LC model, estimates may be inconsistent and impossible combinations may occur. We now extend the MILC method so it incorporates stepwise LC modelling, which makes it possible to estimate relations or apply edit rules with covariates that were not taken into account by the initial LC model
In this paper, we illustrate how we incorporated the three-step approach into the MILC method, we discuss the methodology of the three-step approach and the MILC method. We perform a simulation study to investigate the performance of the MILC method in combination with the three-step appraoch, and we apply the method in practice.
Online survey platforms have made it possible for anyone to quickly and cheaply create, send, and analyze data from their own surveys. Every day, SurveyMonkey users around the world send out 24,000 surveys and 3 million responses are collected on SurveyMonkey. This research looks at the characteristics of these user-created surveys and the experience of respondents, to glean insights on how online surveys can be improved.
We will examine current trends in surveys, from the perspective of both the survey creator and survey taker. Using our database of surveys, we will produce estimates of average completion rate, completion time, and survey topic and examine how those items differ by survey design (length, question type). We will further segment this analysis by desktop versus mobile survey takers.