Limits...
Securely measuring the overlap between private datasets with cryptosets.

Swamidass SJ, Matlock M, Rozenblit L - PLoS ONE (2015)

Bottom Line: This method uses a publicly shareable summary of a dataset's contents, its cryptoset, to estimate its overlap with other datasets.Cryptosets approach "information-theoretic" security, the strongest type of security possible in cryptography, which is not even crackable with infinite computing power.We empirically and theoretically assess both the accuracy of these estimates and the security of the approach, demonstrating that cryptosets are informative, with a stable accuracy, and secure.

View Article: PubMed Central - PubMed

Affiliation: Department of Pathology, Washington University School of Medicine, St. Louis, MO, USA.

ABSTRACT
Many scientific questions are best approached by sharing data--collected by different groups or across large collaborative networks--into a combined analysis. Unfortunately, some of the most interesting and powerful datasets--like health records, genetic data, and drug discovery data--cannot be freely shared because they contain sensitive information. In many situations, knowing if private datasets overlap determines if it is worthwhile to navigate the institutional, ethical, and legal barriers that govern access to sensitive, private data. We report the first method of publicly measuring the overlap between private datasets that is secure under a malicious model without relying on private protocols or message passing. This method uses a publicly shareable summary of a dataset's contents, its cryptoset, to estimate its overlap with other datasets. Cryptosets approach "information-theoretic" security, the strongest type of security possible in cryptography, which is not even crackable with infinite computing power. We empirically and theoretically assess both the accuracy of these estimates and the security of the approach, demonstrating that cryptosets are informative, with a stable accuracy, and secure.

Show MeSH
Cryptosets provide a better tradeoff between estimate error and security risk.The figures plot the average over 50 simulations of the overlap estimate error (on a log scale) against average information gain (in bits) for cryptosets and Bloom filters over a large range of lengths (ranging from 500 to 10000). The ideal method would have both error and information gain close to zero. The left column of figures uses datasets with 100 items, while the right column of uses datasets with 1000 items. The top row simulates these datasets with an overlap of 40%, while the bottom row simulates with an overlap of 80%. Cryptosets are consistently more accurate and secure than Bloom filters, and this trend is most pronounced in large datasets with lower overlaps (top right).
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4340911&req=5

pone.0117898.g007: Cryptosets provide a better tradeoff between estimate error and security risk.The figures plot the average over 50 simulations of the overlap estimate error (on a log scale) against average information gain (in bits) for cryptosets and Bloom filters over a large range of lengths (ranging from 500 to 10000). The ideal method would have both error and information gain close to zero. The left column of figures uses datasets with 100 items, while the right column of uses datasets with 1000 items. The top row simulates these datasets with an overlap of 40%, while the bottom row simulates with an overlap of 80%. Cryptosets are consistently more accurate and secure than Bloom filters, and this trend is most pronounced in large datasets with lower overlaps (top right).

Mentions: There is a trade of between estimate error and security risk. As the number of public IDs increase (the length), estimates become more accurate but the data structure shares more information, becoming less secure. Plotting the trade off between error and information gain (Fig. 7) is another way of comparing Bloom filters and cryptosets. We find that for any given accuracy, cryptosets contain less information than Bloom filters. Moreover, increasing the number of hash functions increases the estimate and the information risk. These results indicate that cryptosets use less information more efficiently than do Bloom filters to make their overlap estimates.


Securely measuring the overlap between private datasets with cryptosets.

Swamidass SJ, Matlock M, Rozenblit L - PLoS ONE (2015)

Cryptosets provide a better tradeoff between estimate error and security risk.The figures plot the average over 50 simulations of the overlap estimate error (on a log scale) against average information gain (in bits) for cryptosets and Bloom filters over a large range of lengths (ranging from 500 to 10000). The ideal method would have both error and information gain close to zero. The left column of figures uses datasets with 100 items, while the right column of uses datasets with 1000 items. The top row simulates these datasets with an overlap of 40%, while the bottom row simulates with an overlap of 80%. Cryptosets are consistently more accurate and secure than Bloom filters, and this trend is most pronounced in large datasets with lower overlaps (top right).
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4340911&req=5

pone.0117898.g007: Cryptosets provide a better tradeoff between estimate error and security risk.The figures plot the average over 50 simulations of the overlap estimate error (on a log scale) against average information gain (in bits) for cryptosets and Bloom filters over a large range of lengths (ranging from 500 to 10000). The ideal method would have both error and information gain close to zero. The left column of figures uses datasets with 100 items, while the right column of uses datasets with 1000 items. The top row simulates these datasets with an overlap of 40%, while the bottom row simulates with an overlap of 80%. Cryptosets are consistently more accurate and secure than Bloom filters, and this trend is most pronounced in large datasets with lower overlaps (top right).
Mentions: There is a trade of between estimate error and security risk. As the number of public IDs increase (the length), estimates become more accurate but the data structure shares more information, becoming less secure. Plotting the trade off between error and information gain (Fig. 7) is another way of comparing Bloom filters and cryptosets. We find that for any given accuracy, cryptosets contain less information than Bloom filters. Moreover, increasing the number of hash functions increases the estimate and the information risk. These results indicate that cryptosets use less information more efficiently than do Bloom filters to make their overlap estimates.

Bottom Line: This method uses a publicly shareable summary of a dataset's contents, its cryptoset, to estimate its overlap with other datasets.Cryptosets approach "information-theoretic" security, the strongest type of security possible in cryptography, which is not even crackable with infinite computing power.We empirically and theoretically assess both the accuracy of these estimates and the security of the approach, demonstrating that cryptosets are informative, with a stable accuracy, and secure.

View Article: PubMed Central - PubMed

Affiliation: Department of Pathology, Washington University School of Medicine, St. Louis, MO, USA.

ABSTRACT
Many scientific questions are best approached by sharing data--collected by different groups or across large collaborative networks--into a combined analysis. Unfortunately, some of the most interesting and powerful datasets--like health records, genetic data, and drug discovery data--cannot be freely shared because they contain sensitive information. In many situations, knowing if private datasets overlap determines if it is worthwhile to navigate the institutional, ethical, and legal barriers that govern access to sensitive, private data. We report the first method of publicly measuring the overlap between private datasets that is secure under a malicious model without relying on private protocols or message passing. This method uses a publicly shareable summary of a dataset's contents, its cryptoset, to estimate its overlap with other datasets. Cryptosets approach "information-theoretic" security, the strongest type of security possible in cryptography, which is not even crackable with infinite computing power. We empirically and theoretically assess both the accuracy of these estimates and the security of the approach, demonstrating that cryptosets are informative, with a stable accuracy, and secure.

Show MeSH