Limits...
Securely measuring the overlap between private datasets with cryptosets.

Swamidass SJ, Matlock M, Rozenblit L - PLoS ONE (2015)

Bottom Line: This method uses a publicly shareable summary of a dataset's contents, its cryptoset, to estimate its overlap with other datasets.Cryptosets approach "information-theoretic" security, the strongest type of security possible in cryptography, which is not even crackable with infinite computing power.We empirically and theoretically assess both the accuracy of these estimates and the security of the approach, demonstrating that cryptosets are informative, with a stable accuracy, and secure.

View Article: PubMed Central - PubMed

Affiliation: Department of Pathology, Washington University School of Medicine, St. Louis, MO, USA.

ABSTRACT
Many scientific questions are best approached by sharing data--collected by different groups or across large collaborative networks--into a combined analysis. Unfortunately, some of the most interesting and powerful datasets--like health records, genetic data, and drug discovery data--cannot be freely shared because they contain sensitive information. In many situations, knowing if private datasets overlap determines if it is worthwhile to navigate the institutional, ethical, and legal barriers that govern access to sensitive, private data. We report the first method of publicly measuring the overlap between private datasets that is secure under a malicious model without relying on private protocols or message passing. This method uses a publicly shareable summary of a dataset's contents, its cryptoset, to estimate its overlap with other datasets. Cryptosets approach "information-theoretic" security, the strongest type of security possible in cryptography, which is not even crackable with infinite computing power. We empirically and theoretically assess both the accuracy of these estimates and the security of the approach, demonstrating that cryptosets are informative, with a stable accuracy, and secure.

Show MeSH

Related in: MedlinePlus

Cryptosets exploit a statistical property of hash functions: they distribute private IDs uniformly across public IDs, assigning several private IDs to the same public ID when truncated.Names and birth dates are distributed with discernible patterns in 2010 census data (Panel A). For example, surnames follow a power law distribution, with a few very common names and a long tail of uncommon names. Birth years fall in a comparatively narrow range. Birth days follow a seasonal and weekly cycle. Nonetheless, hashing 250,000 private IDs—of names with dates of birth—generated from these distributions yields uniformly distributed public IDs (Panel B). Moreover, people binned to the same public ID have no real-world connection to one another. Histograms of these public IDs, cryptosets, can be used to estimate the overlap between private datasets (Panel C). These estimates are computed by multiplying the correlation between cryptosets by the geometric mean of their size. P-values, confidence intervals, and other statistical measures of these estimates can be computed. In the figure, an unrealistically small length of 20 is used for clarity.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4340911&req=5

pone.0117898.g002: Cryptosets exploit a statistical property of hash functions: they distribute private IDs uniformly across public IDs, assigning several private IDs to the same public ID when truncated.Names and birth dates are distributed with discernible patterns in 2010 census data (Panel A). For example, surnames follow a power law distribution, with a few very common names and a long tail of uncommon names. Birth years fall in a comparatively narrow range. Birth days follow a seasonal and weekly cycle. Nonetheless, hashing 250,000 private IDs—of names with dates of birth—generated from these distributions yields uniformly distributed public IDs (Panel B). Moreover, people binned to the same public ID have no real-world connection to one another. Histograms of these public IDs, cryptosets, can be used to estimate the overlap between private datasets (Panel C). These estimates are computed by multiplying the correlation between cryptosets by the geometric mean of their size. P-values, confidence intervals, and other statistical measures of these estimates can be computed. In the figure, an unrealistically small length of 20 is used for clarity.

Mentions: Patient identifiers. We simulated a population of 250,000 patients to use in simulation studies. A name and a birthday were drawn a random according to the distribution of names and birthdays in US census data (Fig. 2A and 2B). The private ID is defined as the lowercase last name concatenated with the numerical birthdate in mmddyyyy format. Public IDs are defined as the SHA256 hash of the private IDs (concatenated with a salt string) modulo the length of the cryptoset, which varies by experiment. The salt string was used like a random seed by changing it between trials. In practice, the salt string would be fixed and publicly specified.


Securely measuring the overlap between private datasets with cryptosets.

Swamidass SJ, Matlock M, Rozenblit L - PLoS ONE (2015)

Cryptosets exploit a statistical property of hash functions: they distribute private IDs uniformly across public IDs, assigning several private IDs to the same public ID when truncated.Names and birth dates are distributed with discernible patterns in 2010 census data (Panel A). For example, surnames follow a power law distribution, with a few very common names and a long tail of uncommon names. Birth years fall in a comparatively narrow range. Birth days follow a seasonal and weekly cycle. Nonetheless, hashing 250,000 private IDs—of names with dates of birth—generated from these distributions yields uniformly distributed public IDs (Panel B). Moreover, people binned to the same public ID have no real-world connection to one another. Histograms of these public IDs, cryptosets, can be used to estimate the overlap between private datasets (Panel C). These estimates are computed by multiplying the correlation between cryptosets by the geometric mean of their size. P-values, confidence intervals, and other statistical measures of these estimates can be computed. In the figure, an unrealistically small length of 20 is used for clarity.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4340911&req=5

pone.0117898.g002: Cryptosets exploit a statistical property of hash functions: they distribute private IDs uniformly across public IDs, assigning several private IDs to the same public ID when truncated.Names and birth dates are distributed with discernible patterns in 2010 census data (Panel A). For example, surnames follow a power law distribution, with a few very common names and a long tail of uncommon names. Birth years fall in a comparatively narrow range. Birth days follow a seasonal and weekly cycle. Nonetheless, hashing 250,000 private IDs—of names with dates of birth—generated from these distributions yields uniformly distributed public IDs (Panel B). Moreover, people binned to the same public ID have no real-world connection to one another. Histograms of these public IDs, cryptosets, can be used to estimate the overlap between private datasets (Panel C). These estimates are computed by multiplying the correlation between cryptosets by the geometric mean of their size. P-values, confidence intervals, and other statistical measures of these estimates can be computed. In the figure, an unrealistically small length of 20 is used for clarity.
Mentions: Patient identifiers. We simulated a population of 250,000 patients to use in simulation studies. A name and a birthday were drawn a random according to the distribution of names and birthdays in US census data (Fig. 2A and 2B). The private ID is defined as the lowercase last name concatenated with the numerical birthdate in mmddyyyy format. Public IDs are defined as the SHA256 hash of the private IDs (concatenated with a salt string) modulo the length of the cryptoset, which varies by experiment. The salt string was used like a random seed by changing it between trials. In practice, the salt string would be fixed and publicly specified.

Bottom Line: This method uses a publicly shareable summary of a dataset's contents, its cryptoset, to estimate its overlap with other datasets.Cryptosets approach "information-theoretic" security, the strongest type of security possible in cryptography, which is not even crackable with infinite computing power.We empirically and theoretically assess both the accuracy of these estimates and the security of the approach, demonstrating that cryptosets are informative, with a stable accuracy, and secure.

View Article: PubMed Central - PubMed

Affiliation: Department of Pathology, Washington University School of Medicine, St. Louis, MO, USA.

ABSTRACT
Many scientific questions are best approached by sharing data--collected by different groups or across large collaborative networks--into a combined analysis. Unfortunately, some of the most interesting and powerful datasets--like health records, genetic data, and drug discovery data--cannot be freely shared because they contain sensitive information. In many situations, knowing if private datasets overlap determines if it is worthwhile to navigate the institutional, ethical, and legal barriers that govern access to sensitive, private data. We report the first method of publicly measuring the overlap between private datasets that is secure under a malicious model without relying on private protocols or message passing. This method uses a publicly shareable summary of a dataset's contents, its cryptoset, to estimate its overlap with other datasets. Cryptosets approach "information-theoretic" security, the strongest type of security possible in cryptography, which is not even crackable with infinite computing power. We empirically and theoretically assess both the accuracy of these estimates and the security of the approach, demonstrating that cryptosets are informative, with a stable accuracy, and secure.

Show MeSH
Related in: MedlinePlus