Tuesday 16th July
Wednesday 17th July
Thursday 18th July
Friday 19th July
Download the conference book
Privacy Preserving Record-Linkage Techniques |
|
Convenor | Professor Rainer Schnell (University of Duisburg-Essen) |
Coordinator 1 | Mr Stefan Bender (German Record Linkage Center) |
Linking survey data to administrative data is gaining more importance every day, even more in Official Statistics. Most survey-researchers concentrate on the willingness of respondents, to give the permission to link their responses to databases. In many countries, such linkages are possible under severe legal constraints, even when no permission by respondents has been given. In such conditions, the identifying information of respondents must be protected by special technical means, for example fault tolerant encrypted identifiers. Such measures are also necessary, when respondent data must be linked to databases, but no Personal Identification Number is available or allowed. Since millions of records must be linked to the sample size of the survey (usually between 1000-30000 respondents), the correct linking of protected identifiers is of utmost importance for this kind of research. Examples for this kind of surveys are large scale medical projects, census based surveys and large scale panels, for example in France, Germany and Switzerland. We would like to discuss the different ongoing projects with colleagues from survey organizations across Europe.
In many fields of research, longitudinal micro data is desirable. Therefore, individuals must be traced over time. For example, in epidemiological research, a national cohort may be tracked lifelong in databases of health care providers. In criminological research, the identity of offenders has to be known for computing individual risk of recidivism. Therefore, a personal identification number is needed for linking data bases over time. If no unique personal identification number is available, the linkage of personal data of the same individual across time must be based on identifying characteristics as names, date of birth, or addresses. Since this raises privacy concerns, methods of privacy preserving identity management in longitudinal research are needed.
So far, quite simple algorithms for the generation of pseudonyms are in common use. However, these algorithms will yield non matching pseudonyms when errors or changes in the underlying information occurs. Recently, the use of special data structure (bloom filters) for the cryptographic encoding of personal identifiers has been suggested. We examine the performance of different identification keys and a variant of the bloom-filter encoding using real data from a large medical data base. The promising results of this comparison demonstrates the usability of bloom-filter based keys for linking large national databases like registers, national panel studies and censuses with strongly encrypted identifiers only.
The practice of survey research often requires the linkage of survey data to large data files like population registries. If no unique personal identification numbers are available for linkage, personal identifiers like names, date of birth etc. have to be used. The use of personal identifiers give rise to privacy problems. Therefore, these identifiers are usually encrypted. If the identifiers contain errors, encryption will cause incomplete linkage between two files. This problem can be solved by the use of recently developed privacy preserving record linkage techniques. These techniques transforms the identifiers into several hundred binary variables. Two cases from two files are considered to match if their set of binary variables agree more closely than any other pair. Statistically this is a nearest-neighbour problem in high-dimensional space. For large scale surveys like censuses, the solution of this problem is computationally demanding. The presentation reviews currently proposed methods and gives a recommendation based on a simulation study.