Technical Problems and Solutions for Record Linkage and Big Data 1 |
|
Convenor | Dr Manfred Antoni (Institute for Employment Research (IAB) ) |
Coordinator 1 | Mr Stefan Bender (Institute for Employment Research (IAB)) |
Coordinator 2 | Professor Rainer Schnell (University of Duisburg-Essen) |
A privacy preserving method of encoding location without using explicit coordinates is presented which allows encoded information to be compared to determine the distance between locations to a desired level of accuracy without the need to encode explicit location data. An discussion of the tradeoff between encoding size, desired accuracy and maximum calculable distance is presented.
The approach is suitable for allowing calculations on geospatial information, e.g. address information, where individual locations must not be readily identifiable for privacy reasons but where records may need to be compared to obtain their distance from one another or from other features.
Privacy preserving record linkage (PPRL) is an academic field dedicated to the study of techniques for linking surveys and/or administrative databases without the use of unique personal identifiers. During the last decade, a number of different PPRL techniques have been suggested and a few of them are actually in use for large scale surveys. The presentation will explain the basis approaches, their advantages and disadvantages concerning performance and cryptographic properties. Based on recent research, recommendations for the practical implementation of PPRL for large data sets with millions of records and missing identifiers will be given.
I will describe work conducted by the American Institutes for Research to link the NSF's Survey of Earned Doctorates (SED) to UMETRICS data, which is an administrative data set that contains payments to employees on federal grants for particfipating universities. Data were linked using a standard Fellegi-Sunter approach. I will describe our process for preparing and linking the data, including our success using Python, MySQL, and an in-house Java implementation of the Fellegi-Sunter algorithm.