Limits...
True Randomness from Big Data

View Article: PubMed Central - PubMed

ABSTRACT

Generating random bits is a difficult task, which is important for physical systems simulation, cryptography, and many applications that rely on high-quality random bits. Our contribution is to show how to generate provably random bits from uncertain events whose outcomes are routinely recorded in the form of massive data sets. These include scientific data sets, such as in astronomics, genomics, as well as data produced by individuals, such as internet search logs, sensor networks, and social network feeds. We view the generation of such data as the sampling process from a big source, which is a random variable of size at least a few gigabytes. Our view initiates the study of big sources in the randomness extraction literature. Previous approaches for big sources rely on statistical assumptions about the samples. We introduce a general method that provably extracts almost-uniform random bits from big sources and extensively validate it empirically on real data sets. The experimental findings indicate that our method is efficient enough to handle large enough sources, while previous extractor constructions are not efficient enough to be practical. Quality-wise, our method at least matches quantum randomness expanders and classical world empirical extractors as measured by standardized tests.

No MeSH data available.


Related in: MedlinePlus

Running time of RRB compared with other extractors.Running time is measured on input samples of size 1 GB–20 GB for von Neumann extractor, local hash extractor (with block-size 1024 bits), and for RRB (with k = n/4, n/8, n/16 and ε = 10−10, 10−20, 10−30). Trevisan’s extractor is only measured for ε = 0.001 on samples of size up to 5 MB = 4 × 107 bits, since the available implementations of finite fields cannot handle larger samples or smaller ε. The running time of Trevisan’s extractor on larger input size (in particular, 103,407 years for 20 GB input) is estimated by polynomial fitting assuming all data in the main memory, which is an unrealistic advantage. The exact form of the fittest polynomial is determined through cross-validation and standard analysis of polynomial norms.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC5036032&req=5

f3: Running time of RRB compared with other extractors.Running time is measured on input samples of size 1 GB–20 GB for von Neumann extractor, local hash extractor (with block-size 1024 bits), and for RRB (with k = n/4, n/8, n/16 and ε = 10−10, 10−20, 10−30). Trevisan’s extractor is only measured for ε = 0.001 on samples of size up to 5 MB = 4 × 107 bits, since the available implementations of finite fields cannot handle larger samples or smaller ε. The running time of Trevisan’s extractor on larger input size (in particular, 103,407 years for 20 GB input) is estimated by polynomial fitting assuming all data in the main memory, which is an unrealistic advantage. The exact form of the fittest polynomial is determined through cross-validation and standard analysis of polynomial norms.

Mentions: A key-feature of the RRB extractor in Fig. 2 is its simplicity, with the technical difficulty being in proving its correctness, which requires a novel, non-trivial analysis. RRB is the first extractor capable of handling big sources without additional assumptions. Previous works require either (i) unrealistic running times or (ii) ad hoc assumptions about the source. In particular, the local extractors such as von Neumann22 and Local Hash fail significantly in terms of output quality, whereas the throughput of Trevisan’s extractor19 and its followups degrade significantly (see Fig. 3) with the size of the sample12 even with practical optimization considered; e.g., 103,407 years of computing time for a 20 GB input sample and ε = 10−3, κ = 1/2. We note that we choose to compare to the Local Hash and von Neumann extractors since these are the only extractors experimented upon in previous work (see ref. 23 for empirical work using von Neumann’s extractor, and see refs 12, 24 and 25 for empirical work using Local Hash), and importantly, both extractors happen to be streaming extractors. Thus, due to their special attention in previous work they are two ideal candidates for comparison. We refer the reader to Table 2, Fig. 3, and the Supplementary Information for details.


True Randomness from Big Data
Running time of RRB compared with other extractors.Running time is measured on input samples of size 1 GB–20 GB for von Neumann extractor, local hash extractor (with block-size 1024 bits), and for RRB (with k = n/4, n/8, n/16 and ε = 10−10, 10−20, 10−30). Trevisan’s extractor is only measured for ε = 0.001 on samples of size up to 5 MB = 4 × 107 bits, since the available implementations of finite fields cannot handle larger samples or smaller ε. The running time of Trevisan’s extractor on larger input size (in particular, 103,407 years for 20 GB input) is estimated by polynomial fitting assuming all data in the main memory, which is an unrealistic advantage. The exact form of the fittest polynomial is determined through cross-validation and standard analysis of polynomial norms.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC5036032&req=5

f3: Running time of RRB compared with other extractors.Running time is measured on input samples of size 1 GB–20 GB for von Neumann extractor, local hash extractor (with block-size 1024 bits), and for RRB (with k = n/4, n/8, n/16 and ε = 10−10, 10−20, 10−30). Trevisan’s extractor is only measured for ε = 0.001 on samples of size up to 5 MB = 4 × 107 bits, since the available implementations of finite fields cannot handle larger samples or smaller ε. The running time of Trevisan’s extractor on larger input size (in particular, 103,407 years for 20 GB input) is estimated by polynomial fitting assuming all data in the main memory, which is an unrealistic advantage. The exact form of the fittest polynomial is determined through cross-validation and standard analysis of polynomial norms.
Mentions: A key-feature of the RRB extractor in Fig. 2 is its simplicity, with the technical difficulty being in proving its correctness, which requires a novel, non-trivial analysis. RRB is the first extractor capable of handling big sources without additional assumptions. Previous works require either (i) unrealistic running times or (ii) ad hoc assumptions about the source. In particular, the local extractors such as von Neumann22 and Local Hash fail significantly in terms of output quality, whereas the throughput of Trevisan’s extractor19 and its followups degrade significantly (see Fig. 3) with the size of the sample12 even with practical optimization considered; e.g., 103,407 years of computing time for a 20 GB input sample and ε = 10−3, κ = 1/2. We note that we choose to compare to the Local Hash and von Neumann extractors since these are the only extractors experimented upon in previous work (see ref. 23 for empirical work using von Neumann’s extractor, and see refs 12, 24 and 25 for empirical work using Local Hash), and importantly, both extractors happen to be streaming extractors. Thus, due to their special attention in previous work they are two ideal candidates for comparison. We refer the reader to Table 2, Fig. 3, and the Supplementary Information for details.

View Article: PubMed Central - PubMed

ABSTRACT

Generating random bits is a difficult task, which is important for physical systems simulation, cryptography, and many applications that rely on high-quality random bits. Our contribution is to show how to generate provably random bits from uncertain events whose outcomes are routinely recorded in the form of massive data sets. These include scientific data sets, such as in astronomics, genomics, as well as data produced by individuals, such as internet search logs, sensor networks, and social network feeds. We view the generation of such data as the sampling process from a big source, which is a random variable of size at least a few gigabytes. Our view initiates the study of big sources in the randomness extraction literature. Previous approaches for big sources rely on statistical assumptions about the samples. We introduce a general method that provably extracts almost-uniform random bits from big sources and extensively validate it empirically on real data sets. The experimental findings indicate that our method is efficient enough to handle large enough sources, while previous extractor constructions are not efficient enough to be practical. Quality-wise, our method at least matches quantum randomness expanders and classical world empirical extractors as measured by standardized tests.

No MeSH data available.


Related in: MedlinePlus