Limits...
Re-identification of home addresses from spatial locations anonymized by Gaussian skew.

Cassa CA, Wieland SC, Mandl KD - Int J Health Geogr (2008)

Bottom Line: We produce multiple anonymized data sets using a single set of addresses and then progressively average the anonymized results related to each address, characterizing the steep decline in distance from the re-identified point to the original location, (and the reduction in privacy).With ten anonymized copies of an original data set, we find a substantial decrease in average distance from 0.7 km to 0.2 km between the estimated, re-identified address and the original address.With fifty anonymized copies of an original data set, we find a decrease in average distance from 0.7 km to 0.1 km.

View Article: PubMed Central - HTML - PubMed

Affiliation: Children's Hospital Informatics Program, Children's Hospital Boston, Boston, MA, USA. cassa@mit.edu

ABSTRACT

Background: Knowledge of the geographical locations of individuals is fundamental to the practice of spatial epidemiology. One approach to preserving the privacy of individual-level addresses in a data set is to de-identify the data using a non-deterministic blurring algorithm that shifts the geocoded values. We investigate a vulnerability in this approach which enables an adversary to re-identify individuals using multiple anonymized versions of the original data set. If several such versions are available, each can be used to incrementally refine estimates of the original geocoded location.

Results: We produce multiple anonymized data sets using a single set of addresses and then progressively average the anonymized results related to each address, characterizing the steep decline in distance from the re-identified point to the original location, (and the reduction in privacy). With ten anonymized copies of an original data set, we find a substantial decrease in average distance from 0.7 km to 0.2 km between the estimated, re-identified address and the original address. With fifty anonymized copies of an original data set, we find a decrease in average distance from 0.7 km to 0.1 km.

Conclusion: We demonstrate that multiple versions of the same data, each anonymized by non-deterministic Gaussian skew, can be used to ascertain original geographic locations. We explore solutions to this problem that include infrastructure to support the safe disclosure of anonymized medical data to prevent inference or re-identification of original address data, and the use of a Markov-process based algorithm to mitigate this risk.

Show MeSH

Related in: MedlinePlus

Markov anonymization process to increase data set anonymity. Markov processes to increase the anonymity level in a data set: an increase in the anonymity level of a data set, for example, increasing from k = 50 to k = 100, could be achieved by increasing the skew level of the k = 50 data set without knowledge of the authentic data. If increases are done in this way, the risk of a reverse identification attempt using averaging can be avoided.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2526988&req=5

Figure 4: Markov anonymization process to increase data set anonymity. Markov processes to increase the anonymity level in a data set: an increase in the anonymity level of a data set, for example, increasing from k = 50 to k = 100, could be achieved by increasing the skew level of the k = 50 data set without knowledge of the authentic data. If increases are done in this way, the risk of a reverse identification attempt using averaging can be avoided.

Mentions: As shown in Figure 3, one possible anonymization scenario is the sharing of data at a variety of privacy levels with different data consumers. To prevent privacy degradation by averaging when sharing data at multiple levels of k-anonymity [11], a Markov state process can be used to successively generate increasingly anonymized versions of the data set. The Markov property guarantees that several versions anonymized this way cannot be used to infer additional information about a patient's location. One example might be the need to provide multiple versions at two anonymized levels, one at k = 50 and another at k = 100. If the anonymization process is restricted to increasing the anonymization level to k = 100 by increasing the skew level from the k = 50 data set, and not from the original data set, there will be no way to decrease the privacy below the k = 50 level, simply by averaging the two data sets. This is illustrated in a Markov process model in Figure 4.


Re-identification of home addresses from spatial locations anonymized by Gaussian skew.

Cassa CA, Wieland SC, Mandl KD - Int J Health Geogr (2008)

Markov anonymization process to increase data set anonymity. Markov processes to increase the anonymity level in a data set: an increase in the anonymity level of a data set, for example, increasing from k = 50 to k = 100, could be achieved by increasing the skew level of the k = 50 data set without knowledge of the authentic data. If increases are done in this way, the risk of a reverse identification attempt using averaging can be avoided.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2526988&req=5

Figure 4: Markov anonymization process to increase data set anonymity. Markov processes to increase the anonymity level in a data set: an increase in the anonymity level of a data set, for example, increasing from k = 50 to k = 100, could be achieved by increasing the skew level of the k = 50 data set without knowledge of the authentic data. If increases are done in this way, the risk of a reverse identification attempt using averaging can be avoided.
Mentions: As shown in Figure 3, one possible anonymization scenario is the sharing of data at a variety of privacy levels with different data consumers. To prevent privacy degradation by averaging when sharing data at multiple levels of k-anonymity [11], a Markov state process can be used to successively generate increasingly anonymized versions of the data set. The Markov property guarantees that several versions anonymized this way cannot be used to infer additional information about a patient's location. One example might be the need to provide multiple versions at two anonymized levels, one at k = 50 and another at k = 100. If the anonymization process is restricted to increasing the anonymization level to k = 100 by increasing the skew level from the k = 50 data set, and not from the original data set, there will be no way to decrease the privacy below the k = 50 level, simply by averaging the two data sets. This is illustrated in a Markov process model in Figure 4.

Bottom Line: We produce multiple anonymized data sets using a single set of addresses and then progressively average the anonymized results related to each address, characterizing the steep decline in distance from the re-identified point to the original location, (and the reduction in privacy).With ten anonymized copies of an original data set, we find a substantial decrease in average distance from 0.7 km to 0.2 km between the estimated, re-identified address and the original address.With fifty anonymized copies of an original data set, we find a decrease in average distance from 0.7 km to 0.1 km.

View Article: PubMed Central - HTML - PubMed

Affiliation: Children's Hospital Informatics Program, Children's Hospital Boston, Boston, MA, USA. cassa@mit.edu

ABSTRACT

Background: Knowledge of the geographical locations of individuals is fundamental to the practice of spatial epidemiology. One approach to preserving the privacy of individual-level addresses in a data set is to de-identify the data using a non-deterministic blurring algorithm that shifts the geocoded values. We investigate a vulnerability in this approach which enables an adversary to re-identify individuals using multiple anonymized versions of the original data set. If several such versions are available, each can be used to incrementally refine estimates of the original geocoded location.

Results: We produce multiple anonymized data sets using a single set of addresses and then progressively average the anonymized results related to each address, characterizing the steep decline in distance from the re-identified point to the original location, (and the reduction in privacy). With ten anonymized copies of an original data set, we find a substantial decrease in average distance from 0.7 km to 0.2 km between the estimated, re-identified address and the original address. With fifty anonymized copies of an original data set, we find a decrease in average distance from 0.7 km to 0.1 km.

Conclusion: We demonstrate that multiple versions of the same data, each anonymized by non-deterministic Gaussian skew, can be used to ascertain original geographic locations. We explore solutions to this problem that include infrastructure to support the safe disclosure of anonymized medical data to prevent inference or re-identification of original address data, and the use of a Markov-process based algorithm to mitigate this risk.

Show MeSH
Related in: MedlinePlus