Limits...
Securely measuring the overlap between private datasets with cryptosets.

Swamidass SJ, Matlock M, Rozenblit L - PLoS ONE (2015)

Bottom Line: This method uses a publicly shareable summary of a dataset's contents, its cryptoset, to estimate its overlap with other datasets.Cryptosets approach "information-theoretic" security, the strongest type of security possible in cryptography, which is not even crackable with infinite computing power.We empirically and theoretically assess both the accuracy of these estimates and the security of the approach, demonstrating that cryptosets are informative, with a stable accuracy, and secure.

View Article: PubMed Central - PubMed

Affiliation: Department of Pathology, Washington University School of Medicine, St. Louis, MO, USA.

ABSTRACT
Many scientific questions are best approached by sharing data--collected by different groups or across large collaborative networks--into a combined analysis. Unfortunately, some of the most interesting and powerful datasets--like health records, genetic data, and drug discovery data--cannot be freely shared because they contain sensitive information. In many situations, knowing if private datasets overlap determines if it is worthwhile to navigate the institutional, ethical, and legal barriers that govern access to sensitive, private data. We report the first method of publicly measuring the overlap between private datasets that is secure under a malicious model without relying on private protocols or message passing. This method uses a publicly shareable summary of a dataset's contents, its cryptoset, to estimate its overlap with other datasets. Cryptosets approach "information-theoretic" security, the strongest type of security possible in cryptography, which is not even crackable with infinite computing power. We empirically and theoretically assess both the accuracy of these estimates and the security of the approach, demonstrating that cryptosets are informative, with a stable accuracy, and secure.

Show MeSH

Related in: MedlinePlus

Bloom filter overlap estimates are unstable, with sharp dependence on dataset size.Bloom filters with one (top row) or two (middle row) hash functions can estimate the overlap between two datasets. However, as the dataset size grows along the x-axis, the estimate error gradually grows and then the estimates suddenly become uncomputable. Once the dataset size becomes too large, the chance of fully saturating the filters sharply increases, causing a sudden, catastrophic increase in error. At saturation, Bloom filters cannot estimate overlap with any accuracy and are plotted as ‘NaN’ values in the graphs. In contrast, cryptoset accuracy (bottom row) is rock-solid stable, and not dependent on dataset size.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4340911&req=5

pone.0117898.g006: Bloom filter overlap estimates are unstable, with sharp dependence on dataset size.Bloom filters with one (top row) or two (middle row) hash functions can estimate the overlap between two datasets. However, as the dataset size grows along the x-axis, the estimate error gradually grows and then the estimates suddenly become uncomputable. Once the dataset size becomes too large, the chance of fully saturating the filters sharply increases, causing a sudden, catastrophic increase in error. At saturation, Bloom filters cannot estimate overlap with any accuracy and are plotted as ‘NaN’ values in the graphs. In contrast, cryptoset accuracy (bottom row) is rock-solid stable, and not dependent on dataset size.

Mentions: Overlaps between private datasets can be estimated using Bloom filters, but the accuracy of these estimates are unstable with a strong dependence on dataset size (Fig. 6). As dataset size increases, Bloom filter estimates gradually become less accurate, and then suddenly become uncomputable when the datasets are large enough to saturate the filters. The exact dataset size of this transition from computable to uncomputable estimates also depends on the degree of overlap between datasets. Moreover, this transition occurs at smaller dataset sizes when the number of hash functions is increased. In contrast, cryptosets retain the exact same accuracy for all dataset sizes and are never uncomputable.


Securely measuring the overlap between private datasets with cryptosets.

Swamidass SJ, Matlock M, Rozenblit L - PLoS ONE (2015)

Bloom filter overlap estimates are unstable, with sharp dependence on dataset size.Bloom filters with one (top row) or two (middle row) hash functions can estimate the overlap between two datasets. However, as the dataset size grows along the x-axis, the estimate error gradually grows and then the estimates suddenly become uncomputable. Once the dataset size becomes too large, the chance of fully saturating the filters sharply increases, causing a sudden, catastrophic increase in error. At saturation, Bloom filters cannot estimate overlap with any accuracy and are plotted as ‘NaN’ values in the graphs. In contrast, cryptoset accuracy (bottom row) is rock-solid stable, and not dependent on dataset size.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4340911&req=5

pone.0117898.g006: Bloom filter overlap estimates are unstable, with sharp dependence on dataset size.Bloom filters with one (top row) or two (middle row) hash functions can estimate the overlap between two datasets. However, as the dataset size grows along the x-axis, the estimate error gradually grows and then the estimates suddenly become uncomputable. Once the dataset size becomes too large, the chance of fully saturating the filters sharply increases, causing a sudden, catastrophic increase in error. At saturation, Bloom filters cannot estimate overlap with any accuracy and are plotted as ‘NaN’ values in the graphs. In contrast, cryptoset accuracy (bottom row) is rock-solid stable, and not dependent on dataset size.
Mentions: Overlaps between private datasets can be estimated using Bloom filters, but the accuracy of these estimates are unstable with a strong dependence on dataset size (Fig. 6). As dataset size increases, Bloom filter estimates gradually become less accurate, and then suddenly become uncomputable when the datasets are large enough to saturate the filters. The exact dataset size of this transition from computable to uncomputable estimates also depends on the degree of overlap between datasets. Moreover, this transition occurs at smaller dataset sizes when the number of hash functions is increased. In contrast, cryptosets retain the exact same accuracy for all dataset sizes and are never uncomputable.

Bottom Line: This method uses a publicly shareable summary of a dataset's contents, its cryptoset, to estimate its overlap with other datasets.Cryptosets approach "information-theoretic" security, the strongest type of security possible in cryptography, which is not even crackable with infinite computing power.We empirically and theoretically assess both the accuracy of these estimates and the security of the approach, demonstrating that cryptosets are informative, with a stable accuracy, and secure.

View Article: PubMed Central - PubMed

Affiliation: Department of Pathology, Washington University School of Medicine, St. Louis, MO, USA.

ABSTRACT
Many scientific questions are best approached by sharing data--collected by different groups or across large collaborative networks--into a combined analysis. Unfortunately, some of the most interesting and powerful datasets--like health records, genetic data, and drug discovery data--cannot be freely shared because they contain sensitive information. In many situations, knowing if private datasets overlap determines if it is worthwhile to navigate the institutional, ethical, and legal barriers that govern access to sensitive, private data. We report the first method of publicly measuring the overlap between private datasets that is secure under a malicious model without relying on private protocols or message passing. This method uses a publicly shareable summary of a dataset's contents, its cryptoset, to estimate its overlap with other datasets. Cryptosets approach "information-theoretic" security, the strongest type of security possible in cryptography, which is not even crackable with infinite computing power. We empirically and theoretically assess both the accuracy of these estimates and the security of the approach, demonstrating that cryptosets are informative, with a stable accuracy, and secure.

Show MeSH
Related in: MedlinePlus