ESRA logo
Tuesday 18th July      Wednesday 19th July      Thursday 20th July      Friday 21th July     




Tuesday 18th July, 14:00 - 15:30 Room: F2 102


Linking Big Data and Surveys in Practice: Solutions for Respondents Privacy Protection

Chair Professor Rainer Schnell (City University London )

Session Details

Linking survey data to big data and administrative data is an increasingly popular research strategy in official statistics as well as in social and medical sciences. In research settings using micro data, privacy of respondents is of utmost importance. Examples are geocoded services or respondent addresses, where the actual locations have to be protected by statistical measures or mathematical transformations (geomasking). Similar problems result by the availability of mobility tracks produced by cars, phones or laptops. The problems are even more severe with biomarkers or genetic data. Finally, linking different databases and/or surveys is possible in practice only if the privacy of the respondents can be protected. This requires in many cases the use of Privacy Preserving Record Linkage Techniques.

During the last 10 years, the privacy problems created by the increasing availability of big data and survey data has given rise to many different mathematical and statistical techniques to preserve the privacy of the respondents. The goal of the session is the presentation of these techniques to a broader audience. Tutorials as well as recent developments are welcome.
We invite presentations on:

1. Analyzing individual geographical data without revealing locations
2. Protecting anonymity in the analysis of mobility profiles
3. Privacy Preserving Record Linkage
4. Privacy of biomarkers and genetic data.

Research on informed consent will not be covered in this session. Statistical disclose control is also considered as a topic for other sessions.

Paper Details

1. Putting people on the map without revealing their location: Privacy preserving locational distance computations
Professor Rainer Schnell (Universität Duisburg-Essen)
Mr Jonas Klingwort (Universität Duisburg-Essen)

The increasing availability of geographically referenced auxiliary data which can be included in an enhanced survey data set implies additional re-identifiaction risks for the respondents. To protect the privacy of the survey respondents different methods to protect against re-identifaction attacks have been proposed in the literature (geomasking). Most methods either use aggregation or use the addition of random noise. Both methods reduce the usefulness of the geo-information. Therfore new methods for geomasking are needed for survey resreach using small area geo-referenced data. A new idea for geomasking was originally presented by Farrow (2014) and further studied by Farrow & Schnell (2016).

This method allows the computation of distances without revealing the location of the respondents. The method is based on randomly labeled grid points on a map. The distance is approximated by the number of common grid points within a defined circle centered at the location of interest. We report on extensive simulations of method for survey based studies. Furthermore, we will demonstrate how this technique might be used for including geographical information in record linkage tasks.


2. An Overview of State of the Art Bloom Filter-based Privacy-preserving Record Linkage for Very Large Databases
Professor Rainer Schnell (University of Duisburg-Essen)
Mr Christian Borgs (University of Duisburg-Essen)

Enriching survey data with data from other sources, such as administrative data, process-generated data or big data is becoming increasingly popular in the social sciences,
medical research, criminology and other quantitative research fields.
If the data entities are natural persons, linking different data sources is often restricted to encrypted personal identifiers when no unique personal linkage key (PID) is available. In this scenario, techniques of the very active research field of Privacy-preserving Record Linkage (PPRL) have to be used.

Recently, Bloom Filters for PPRL applications have become increasingly popular, due to their similarity-preserving properties which enable the use of similarity threshold-based linkage techniques. Since the inherent properties will eventually lead to security vulnerabilities,
the cryptographic properties of Bloom Filters are another central point. This contribution will give an overview of the recent developments in terms of cryptographic attacks, preventing these by making the encryptions more resilient,
as well as solutions for very large data bases using Bloom Filter encryptions with Multibit trees.

Using large-scale data, state of the art Bloom Filter encryption variants are tested against other best-practice methods,
namely ALCs in terms of linkage quality and scalability (linking speed versus file size).
Finally, state of the art best-practice guidelines for Bloom Filter-based PPRL will be given.