Limits...
Securely measuring the overlap between private datasets with cryptosets.

Swamidass SJ, Matlock M, Rozenblit L - PLoS ONE (2015)

Bottom Line: This method uses a publicly shareable summary of a dataset's contents, its cryptoset, to estimate its overlap with other datasets.Cryptosets approach "information-theoretic" security, the strongest type of security possible in cryptography, which is not even crackable with infinite computing power.We empirically and theoretically assess both the accuracy of these estimates and the security of the approach, demonstrating that cryptosets are informative, with a stable accuracy, and secure.

View Article: PubMed Central - PubMed

Affiliation: Department of Pathology, Washington University School of Medicine, St. Louis, MO, USA.

ABSTRACT
Many scientific questions are best approached by sharing data--collected by different groups or across large collaborative networks--into a combined analysis. Unfortunately, some of the most interesting and powerful datasets--like health records, genetic data, and drug discovery data--cannot be freely shared because they contain sensitive information. In many situations, knowing if private datasets overlap determines if it is worthwhile to navigate the institutional, ethical, and legal barriers that govern access to sensitive, private data. We report the first method of publicly measuring the overlap between private datasets that is secure under a malicious model without relying on private protocols or message passing. This method uses a publicly shareable summary of a dataset's contents, its cryptoset, to estimate its overlap with other datasets. Cryptosets approach "information-theoretic" security, the strongest type of security possible in cryptography, which is not even crackable with infinite computing power. We empirically and theoretically assess both the accuracy of these estimates and the security of the approach, demonstrating that cryptosets are informative, with a stable accuracy, and secure.

Show MeSH

Related in: MedlinePlus

Information gain of cryptosets (y-axis) as a function of number of public IDs (the length) and the number of elements in the cryptoset (A on the x-axis).Our estimates of the information gain are very close to simulations. The curve almost perfectly fits the data with 5 summation terms. A key finding of this study that is evident in this graph is that the information content in cryptosets rapidly approaches zero as the number of items is increased. This makes sense, because as more items are added the distribution of public IDs approaches a uniform, non-informative distribution.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4340911&req=5

pone.0117898.g005: Information gain of cryptosets (y-axis) as a function of number of public IDs (the length) and the number of elements in the cryptoset (A on the x-axis).Our estimates of the information gain are very close to simulations. The curve almost perfectly fits the data with 5 summation terms. A key finding of this study that is evident in this graph is that the information content in cryptosets rapidly approaches zero as the number of items is increased. This makes sense, because as more items are added the distribution of public IDs approaches a uniform, non-informative distribution.

Mentions: Here, we present an analysis of the security of cryptosets, using information gain as a measure of security risk. Datasets are generated by random sampling, the corresponding cryptosets are constructed for a large range of sample sizes and lengths, and the information gain of these cryptosets is computed (Fig. 5). The key finding from this analysis is that cryptosets become more secure as more items are added. Since the cryptographic hash distributes the private IDs evenly across public IDs (Fig. 2 and 3), the more items in the cryptoset, the closer it is to a uniform distribution and the less informative it is about specific public IDs. As the number of items increases, cryptoset security rapidly approaches “information-theoretic” security, the strongest type of security possible in cryptography, which cannot be broken even with infinite computing power [40–42]. This property makes cryptosets very resistant to attack, even under a malicious model.


Securely measuring the overlap between private datasets with cryptosets.

Swamidass SJ, Matlock M, Rozenblit L - PLoS ONE (2015)

Information gain of cryptosets (y-axis) as a function of number of public IDs (the length) and the number of elements in the cryptoset (A on the x-axis).Our estimates of the information gain are very close to simulations. The curve almost perfectly fits the data with 5 summation terms. A key finding of this study that is evident in this graph is that the information content in cryptosets rapidly approaches zero as the number of items is increased. This makes sense, because as more items are added the distribution of public IDs approaches a uniform, non-informative distribution.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4340911&req=5

pone.0117898.g005: Information gain of cryptosets (y-axis) as a function of number of public IDs (the length) and the number of elements in the cryptoset (A on the x-axis).Our estimates of the information gain are very close to simulations. The curve almost perfectly fits the data with 5 summation terms. A key finding of this study that is evident in this graph is that the information content in cryptosets rapidly approaches zero as the number of items is increased. This makes sense, because as more items are added the distribution of public IDs approaches a uniform, non-informative distribution.
Mentions: Here, we present an analysis of the security of cryptosets, using information gain as a measure of security risk. Datasets are generated by random sampling, the corresponding cryptosets are constructed for a large range of sample sizes and lengths, and the information gain of these cryptosets is computed (Fig. 5). The key finding from this analysis is that cryptosets become more secure as more items are added. Since the cryptographic hash distributes the private IDs evenly across public IDs (Fig. 2 and 3), the more items in the cryptoset, the closer it is to a uniform distribution and the less informative it is about specific public IDs. As the number of items increases, cryptoset security rapidly approaches “information-theoretic” security, the strongest type of security possible in cryptography, which cannot be broken even with infinite computing power [40–42]. This property makes cryptosets very resistant to attack, even under a malicious model.

Bottom Line: This method uses a publicly shareable summary of a dataset's contents, its cryptoset, to estimate its overlap with other datasets.Cryptosets approach "information-theoretic" security, the strongest type of security possible in cryptography, which is not even crackable with infinite computing power.We empirically and theoretically assess both the accuracy of these estimates and the security of the approach, demonstrating that cryptosets are informative, with a stable accuracy, and secure.

View Article: PubMed Central - PubMed

Affiliation: Department of Pathology, Washington University School of Medicine, St. Louis, MO, USA.

ABSTRACT
Many scientific questions are best approached by sharing data--collected by different groups or across large collaborative networks--into a combined analysis. Unfortunately, some of the most interesting and powerful datasets--like health records, genetic data, and drug discovery data--cannot be freely shared because they contain sensitive information. In many situations, knowing if private datasets overlap determines if it is worthwhile to navigate the institutional, ethical, and legal barriers that govern access to sensitive, private data. We report the first method of publicly measuring the overlap between private datasets that is secure under a malicious model without relying on private protocols or message passing. This method uses a publicly shareable summary of a dataset's contents, its cryptoset, to estimate its overlap with other datasets. Cryptosets approach "information-theoretic" security, the strongest type of security possible in cryptography, which is not even crackable with infinite computing power. We empirically and theoretically assess both the accuracy of these estimates and the security of the approach, demonstrating that cryptosets are informative, with a stable accuracy, and secure.

Show MeSH
Related in: MedlinePlus