Limits...
Securely measuring the overlap between private datasets with cryptosets.

Swamidass SJ, Matlock M, Rozenblit L - PLoS ONE (2015)

Bottom Line: This method uses a publicly shareable summary of a dataset's contents, its cryptoset, to estimate its overlap with other datasets.Cryptosets approach "information-theoretic" security, the strongest type of security possible in cryptography, which is not even crackable with infinite computing power.We empirically and theoretically assess both the accuracy of these estimates and the security of the approach, demonstrating that cryptosets are informative, with a stable accuracy, and secure.

View Article: PubMed Central - PubMed

Affiliation: Department of Pathology, Washington University School of Medicine, St. Louis, MO, USA.

ABSTRACT
Many scientific questions are best approached by sharing data--collected by different groups or across large collaborative networks--into a combined analysis. Unfortunately, some of the most interesting and powerful datasets--like health records, genetic data, and drug discovery data--cannot be freely shared because they contain sensitive information. In many situations, knowing if private datasets overlap determines if it is worthwhile to navigate the institutional, ethical, and legal barriers that govern access to sensitive, private data. We report the first method of publicly measuring the overlap between private datasets that is secure under a malicious model without relying on private protocols or message passing. This method uses a publicly shareable summary of a dataset's contents, its cryptoset, to estimate its overlap with other datasets. Cryptosets approach "information-theoretic" security, the strongest type of security possible in cryptography, which is not even crackable with infinite computing power. We empirically and theoretically assess both the accuracy of these estimates and the security of the approach, demonstrating that cryptosets are informative, with a stable accuracy, and secure.

Show MeSH
Cryptosets can measure the overlap between chemical collections.In this figure we compare two molecular libraries, which have about 5000 scaffolds in common. The public IDs from the libraries’ scaffolds are nearly evenly distributed across public IDs, but a subtle, statistically significant correlation demonstrates they overlap. The estimated overlap is quite good. Moreover, the privacy of the libraries is maintained. Within each public ID bin (representative examples shown for one bin), there are both scaffolds unique and common to each library, and there is no way to determine which are which from the cryptosets. Sharing overlaps between molecular libraries could help researchers know when it makes sense to screen a private molecule library with a biological assay.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4340911&req=5

pone.0117898.g003: Cryptosets can measure the overlap between chemical collections.In this figure we compare two molecular libraries, which have about 5000 scaffolds in common. The public IDs from the libraries’ scaffolds are nearly evenly distributed across public IDs, but a subtle, statistically significant correlation demonstrates they overlap. The estimated overlap is quite good. Moreover, the privacy of the libraries is maintained. Within each public ID bin (representative examples shown for one bin), there are both scaffolds unique and common to each library, and there is no way to determine which are which from the cryptosets. Sharing overlaps between molecular libraries could help researchers know when it makes sense to screen a private molecule library with a biological assay.

Mentions: For the second example, we computed the overlap between two chemical libraries (Fig. 3). In this case, there were tens of thousands of scaffolds summarized by cryptosets of just 100 public IDs. Both cryptosets were entirely saturated, looking almost like flat, uniform histograms. Notwithstanding our theoretical results, it intuitively seems like computing overlap would be impossible in this case. Nonetheless, we find the overlap estimates are quite close to the true overlap. This agrees with the theoretical analysis, and demonstrates how our intuition can be misleading on this problem.


Securely measuring the overlap between private datasets with cryptosets.

Swamidass SJ, Matlock M, Rozenblit L - PLoS ONE (2015)

Cryptosets can measure the overlap between chemical collections.In this figure we compare two molecular libraries, which have about 5000 scaffolds in common. The public IDs from the libraries’ scaffolds are nearly evenly distributed across public IDs, but a subtle, statistically significant correlation demonstrates they overlap. The estimated overlap is quite good. Moreover, the privacy of the libraries is maintained. Within each public ID bin (representative examples shown for one bin), there are both scaffolds unique and common to each library, and there is no way to determine which are which from the cryptosets. Sharing overlaps between molecular libraries could help researchers know when it makes sense to screen a private molecule library with a biological assay.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4340911&req=5

pone.0117898.g003: Cryptosets can measure the overlap between chemical collections.In this figure we compare two molecular libraries, which have about 5000 scaffolds in common. The public IDs from the libraries’ scaffolds are nearly evenly distributed across public IDs, but a subtle, statistically significant correlation demonstrates they overlap. The estimated overlap is quite good. Moreover, the privacy of the libraries is maintained. Within each public ID bin (representative examples shown for one bin), there are both scaffolds unique and common to each library, and there is no way to determine which are which from the cryptosets. Sharing overlaps between molecular libraries could help researchers know when it makes sense to screen a private molecule library with a biological assay.
Mentions: For the second example, we computed the overlap between two chemical libraries (Fig. 3). In this case, there were tens of thousands of scaffolds summarized by cryptosets of just 100 public IDs. Both cryptosets were entirely saturated, looking almost like flat, uniform histograms. Notwithstanding our theoretical results, it intuitively seems like computing overlap would be impossible in this case. Nonetheless, we find the overlap estimates are quite close to the true overlap. This agrees with the theoretical analysis, and demonstrates how our intuition can be misleading on this problem.

Bottom Line: This method uses a publicly shareable summary of a dataset's contents, its cryptoset, to estimate its overlap with other datasets.Cryptosets approach "information-theoretic" security, the strongest type of security possible in cryptography, which is not even crackable with infinite computing power.We empirically and theoretically assess both the accuracy of these estimates and the security of the approach, demonstrating that cryptosets are informative, with a stable accuracy, and secure.

View Article: PubMed Central - PubMed

Affiliation: Department of Pathology, Washington University School of Medicine, St. Louis, MO, USA.

ABSTRACT
Many scientific questions are best approached by sharing data--collected by different groups or across large collaborative networks--into a combined analysis. Unfortunately, some of the most interesting and powerful datasets--like health records, genetic data, and drug discovery data--cannot be freely shared because they contain sensitive information. In many situations, knowing if private datasets overlap determines if it is worthwhile to navigate the institutional, ethical, and legal barriers that govern access to sensitive, private data. We report the first method of publicly measuring the overlap between private datasets that is secure under a malicious model without relying on private protocols or message passing. This method uses a publicly shareable summary of a dataset's contents, its cryptoset, to estimate its overlap with other datasets. Cryptosets approach "information-theoretic" security, the strongest type of security possible in cryptography, which is not even crackable with infinite computing power. We empirically and theoretically assess both the accuracy of these estimates and the security of the approach, demonstrating that cryptosets are informative, with a stable accuracy, and secure.

Show MeSH