Limits...
Securely measuring the overlap between private datasets with cryptosets.

Swamidass SJ, Matlock M, Rozenblit L - PLoS ONE (2015)

Bottom Line: This method uses a publicly shareable summary of a dataset's contents, its cryptoset, to estimate its overlap with other datasets.Cryptosets approach "information-theoretic" security, the strongest type of security possible in cryptography, which is not even crackable with infinite computing power.We empirically and theoretically assess both the accuracy of these estimates and the security of the approach, demonstrating that cryptosets are informative, with a stable accuracy, and secure.

View Article: PubMed Central - PubMed

Affiliation: Department of Pathology, Washington University School of Medicine, St. Louis, MO, USA.

ABSTRACT
Many scientific questions are best approached by sharing data--collected by different groups or across large collaborative networks--into a combined analysis. Unfortunately, some of the most interesting and powerful datasets--like health records, genetic data, and drug discovery data--cannot be freely shared because they contain sensitive information. In many situations, knowing if private datasets overlap determines if it is worthwhile to navigate the institutional, ethical, and legal barriers that govern access to sensitive, private data. We report the first method of publicly measuring the overlap between private datasets that is secure under a malicious model without relying on private protocols or message passing. This method uses a publicly shareable summary of a dataset's contents, its cryptoset, to estimate its overlap with other datasets. Cryptosets approach "information-theoretic" security, the strongest type of security possible in cryptography, which is not even crackable with infinite computing power. We empirically and theoretically assess both the accuracy of these estimates and the security of the approach, demonstrating that cryptosets are informative, with a stable accuracy, and secure.

Show MeSH
Cryptosets stably estimate the overlap proportion between private datasets, no matter the dataset size, and with accuracy tunable by length.Each column of figures corresponds to a different number of public IDs: 500, 1000 and 2000. The first row shows the results of an empirical study, demonstrating that the error (the spread of each data series) is stable across all dataset sizes. The second row shows the analytically derived 95% confidence intervals (Equation 9), which closely match the distribution of empirical estimates and are stable across all dataset sizes. Also evident in these figures is that estimate accuracy is tuned by the length (the number of possible public IDs) of the cryptosets.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4340911&req=5

pone.0117898.g004: Cryptosets stably estimate the overlap proportion between private datasets, no matter the dataset size, and with accuracy tunable by length.Each column of figures corresponds to a different number of public IDs: 500, 1000 and 2000. The first row shows the results of an empirical study, demonstrating that the error (the spread of each data series) is stable across all dataset sizes. The second row shows the analytically derived 95% confidence intervals (Equation 9), which closely match the distribution of empirical estimates and are stable across all dataset sizes. Also evident in these figures is that estimate accuracy is tuned by the length (the number of possible public IDs) of the cryptosets.

Mentions: Finally, we systematically studied the overlap estimates across a large range of scenarios. From the patient identifier data, we simulated pairs of private datasets with overlap proportions of 10%, 30%, 50%, and 80%, ranging in size from 10 to 10,000. The correct overlap between each pair was compared to the estimated overlap using cryptosets of length 500, 1000, and 2000 (Fig. 4). We see in this study (1) that the overlap estimate is unbiased, (2) the accuracy of the estimate is exactly proportional to the dataset’s size, meaning the relative error is constant, and (3) 95% confidence intervals quite closely matched the observed results.


Securely measuring the overlap between private datasets with cryptosets.

Swamidass SJ, Matlock M, Rozenblit L - PLoS ONE (2015)

Cryptosets stably estimate the overlap proportion between private datasets, no matter the dataset size, and with accuracy tunable by length.Each column of figures corresponds to a different number of public IDs: 500, 1000 and 2000. The first row shows the results of an empirical study, demonstrating that the error (the spread of each data series) is stable across all dataset sizes. The second row shows the analytically derived 95% confidence intervals (Equation 9), which closely match the distribution of empirical estimates and are stable across all dataset sizes. Also evident in these figures is that estimate accuracy is tuned by the length (the number of possible public IDs) of the cryptosets.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4340911&req=5

pone.0117898.g004: Cryptosets stably estimate the overlap proportion between private datasets, no matter the dataset size, and with accuracy tunable by length.Each column of figures corresponds to a different number of public IDs: 500, 1000 and 2000. The first row shows the results of an empirical study, demonstrating that the error (the spread of each data series) is stable across all dataset sizes. The second row shows the analytically derived 95% confidence intervals (Equation 9), which closely match the distribution of empirical estimates and are stable across all dataset sizes. Also evident in these figures is that estimate accuracy is tuned by the length (the number of possible public IDs) of the cryptosets.
Mentions: Finally, we systematically studied the overlap estimates across a large range of scenarios. From the patient identifier data, we simulated pairs of private datasets with overlap proportions of 10%, 30%, 50%, and 80%, ranging in size from 10 to 10,000. The correct overlap between each pair was compared to the estimated overlap using cryptosets of length 500, 1000, and 2000 (Fig. 4). We see in this study (1) that the overlap estimate is unbiased, (2) the accuracy of the estimate is exactly proportional to the dataset’s size, meaning the relative error is constant, and (3) 95% confidence intervals quite closely matched the observed results.

Bottom Line: This method uses a publicly shareable summary of a dataset's contents, its cryptoset, to estimate its overlap with other datasets.Cryptosets approach "information-theoretic" security, the strongest type of security possible in cryptography, which is not even crackable with infinite computing power.We empirically and theoretically assess both the accuracy of these estimates and the security of the approach, demonstrating that cryptosets are informative, with a stable accuracy, and secure.

View Article: PubMed Central - PubMed

Affiliation: Department of Pathology, Washington University School of Medicine, St. Louis, MO, USA.

ABSTRACT
Many scientific questions are best approached by sharing data--collected by different groups or across large collaborative networks--into a combined analysis. Unfortunately, some of the most interesting and powerful datasets--like health records, genetic data, and drug discovery data--cannot be freely shared because they contain sensitive information. In many situations, knowing if private datasets overlap determines if it is worthwhile to navigate the institutional, ethical, and legal barriers that govern access to sensitive, private data. We report the first method of publicly measuring the overlap between private datasets that is secure under a malicious model without relying on private protocols or message passing. This method uses a publicly shareable summary of a dataset's contents, its cryptoset, to estimate its overlap with other datasets. Cryptosets approach "information-theoretic" security, the strongest type of security possible in cryptography, which is not even crackable with infinite computing power. We empirically and theoretically assess both the accuracy of these estimates and the security of the approach, demonstrating that cryptosets are informative, with a stable accuracy, and secure.

Show MeSH