Linking Big Data and Surveys in Practice: Solutions for Respondents Privacy Protection |
|
Chair | Professor Rainer Schnell (City University London ) |
The increasing availability of geographically referenced auxiliary data which can be included in an enhanced survey data set implies additional re-identifiaction risks for the respondents. To protect the privacy of the survey respondents different methods to protect against re-identifaction attacks have been proposed in the literature (geomasking). Most methods either use aggregation or use the addition of random noise. Both methods reduce the usefulness of the geo-information. Therfore new methods for geomasking are needed for survey resreach using small area geo-referenced data. A new idea for geomasking was originally presented by Farrow (2014) and further studied by Farrow & Schnell (2016).
This method allows the computation of distances without revealing the location of the respondents. The method is based on randomly labeled grid points on a map. The distance is approximated by the number of common grid points within a defined circle centered at the location of interest. We report on extensive simulations of method for survey based studies. Furthermore, we will demonstrate how this technique might be used for including geographical information in record linkage tasks.
Enriching survey data with data from other sources, such as administrative data, process-generated data or big data is becoming increasingly popular in the social sciences,
medical research, criminology and other quantitative research fields.
If the data entities are natural persons, linking different data sources is often restricted to encrypted personal identifiers when no unique personal linkage key (PID) is available. In this scenario, techniques of the very active research field of Privacy-preserving Record Linkage (PPRL) have to be used.
Recently, Bloom Filters for PPRL applications have become increasingly popular, due to their similarity-preserving properties which enable the use of similarity threshold-based linkage techniques. Since the inherent properties will eventually lead to security vulnerabilities,
the cryptographic properties of Bloom Filters are another central point. This contribution will give an overview of the recent developments in terms of cryptographic attacks, preventing these by making the encryptions more resilient,
as well as solutions for very large data bases using Bloom Filter encryptions with Multibit trees.
Using large-scale data, state of the art Bloom Filter encryption variants are tested against other best-practice methods,
namely ALCs in terms of linkage quality and scalability (linking speed versus file size).
Finally, state of the art best-practice guidelines for Bloom Filter-based PPRL will be given.