Limits...
Sequence Factorization with Multiple References.

Wandelt S, Leser U - PLoS ONE (2015)

Bottom Line: Our results show a wide range of factorization sizes (optimal to an overhead of up to 300%), factorization speed (0.01 MB/s to more than 600 MB/s), and main memory usage (few dozen MB to dozens of GB).Based on our evaluation, we identify the best configurations for common use cases.Our evaluation shows that multi-reference factorization is much better than single-reference factorization.

View Article: PubMed Central - PubMed

Affiliation: Knowledge Management in Bioinformatics, Humboldt-University of Berlin, Rudower Chaussee 25, 12489 Berlin, Germany.

ABSTRACT
The success of high-throughput sequencing has lead to an increasing number of projects which sequence large populations of a species. Storage and analysis of sequence data is a key challenge in these projects, because of the sheer size of the datasets. Compression is one simple technology to deal with this challenge. Referential factorization and compression schemes, which store only the differences between input sequence and a reference sequence, gained lots of interest in this field. Highly-similar sequences, e.g., Human genomes, can be compressed with a compression ratio of 1,000:1 and more, up to two orders of magnitude better than with standard compression techniques. Recently, it was shown that the compression against multiple references from the same species can boost the compression ratio up to 4,000:1. However, a detailed analysis of using multiple references is lacking, e.g., for main memory consumption and optimality. In this paper, we describe one key technique for the referential compression against multiple references: The factorization of sequences. Based on the notion of an optimal factorization, we propose optimization heuristics and identify parameter settings which greatly influence 1) the size of the factorization, 2) the time for factorization, and 3) the required amount of main memory. We evaluate a total of 30 setups with a varying number of references on data from three different species. Our results show a wide range of factorization sizes (optimal to an overhead of up to 300%), factorization speed (0.01 MB/s to more than 600 MB/s), and main memory usage (few dozen MB to dozens of GB). Based on our evaluation, we identify the best configurations for common use cases. Our evaluation shows that multi-reference factorization is much better than single-reference factorization.

No MeSH data available.


Related factorization algorithms.We show results for related factorization algorithms: main memory usage, factorization speed, and mean number of factors per sequence for dataset HG21. For ease of comparison, we added few interesting instances of our own factorization algorithms (marked with a dashed line). Note that the mean number of factors cannot be compared directly between related work and our algorithms, since our RMEs contain mismatch characters, while related work factorizations are solely pointer into references (which yields more factors, albeit being optimal).
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4589410&req=5

pone.0139000.g008: Related factorization algorithms.We show results for related factorization algorithms: main memory usage, factorization speed, and mean number of factors per sequence for dataset HG21. For ease of comparison, we added few interesting instances of our own factorization algorithms (marked with a dashed line). Note that the mean number of factors cannot be compared directly between related work and our algorithms, since our RMEs contain mismatch characters, while related work factorizations are solely pointer into references (which yields more factors, albeit being optimal).

Mentions: We evaluated experimentally the following other factorization algorithms: ISA6r [23], KKP3 [32], and KKP1s [32], whose implementations are available online at https://www.cs.helsinki.fi/group/pads/lz77.html. According to that webpage, ISA6r is specialized for highly repetitive inputs, KKP3 is the fastest for non-highly repetitive inputs, and KKP1s streams a suffix array from the disk. The experimental results are reported in Fig 8. All factorization algorithms need at least n*∣LR∣ bytes of storage in memory (usually more), where LR is the length of the shortest reference and n is the number of references. Moreover, the factorization speed is often much slower than our own baseline implementation with enhanced suffix arrays. The size of factorizations decreases tremendously with the number of references, similarly to our own experiments.


Sequence Factorization with Multiple References.

Wandelt S, Leser U - PLoS ONE (2015)

Related factorization algorithms.We show results for related factorization algorithms: main memory usage, factorization speed, and mean number of factors per sequence for dataset HG21. For ease of comparison, we added few interesting instances of our own factorization algorithms (marked with a dashed line). Note that the mean number of factors cannot be compared directly between related work and our algorithms, since our RMEs contain mismatch characters, while related work factorizations are solely pointer into references (which yields more factors, albeit being optimal).
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4589410&req=5

pone.0139000.g008: Related factorization algorithms.We show results for related factorization algorithms: main memory usage, factorization speed, and mean number of factors per sequence for dataset HG21. For ease of comparison, we added few interesting instances of our own factorization algorithms (marked with a dashed line). Note that the mean number of factors cannot be compared directly between related work and our algorithms, since our RMEs contain mismatch characters, while related work factorizations are solely pointer into references (which yields more factors, albeit being optimal).
Mentions: We evaluated experimentally the following other factorization algorithms: ISA6r [23], KKP3 [32], and KKP1s [32], whose implementations are available online at https://www.cs.helsinki.fi/group/pads/lz77.html. According to that webpage, ISA6r is specialized for highly repetitive inputs, KKP3 is the fastest for non-highly repetitive inputs, and KKP1s streams a suffix array from the disk. The experimental results are reported in Fig 8. All factorization algorithms need at least n*∣LR∣ bytes of storage in memory (usually more), where LR is the length of the shortest reference and n is the number of references. Moreover, the factorization speed is often much slower than our own baseline implementation with enhanced suffix arrays. The size of factorizations decreases tremendously with the number of references, similarly to our own experiments.

Bottom Line: Our results show a wide range of factorization sizes (optimal to an overhead of up to 300%), factorization speed (0.01 MB/s to more than 600 MB/s), and main memory usage (few dozen MB to dozens of GB).Based on our evaluation, we identify the best configurations for common use cases.Our evaluation shows that multi-reference factorization is much better than single-reference factorization.

View Article: PubMed Central - PubMed

Affiliation: Knowledge Management in Bioinformatics, Humboldt-University of Berlin, Rudower Chaussee 25, 12489 Berlin, Germany.

ABSTRACT
The success of high-throughput sequencing has lead to an increasing number of projects which sequence large populations of a species. Storage and analysis of sequence data is a key challenge in these projects, because of the sheer size of the datasets. Compression is one simple technology to deal with this challenge. Referential factorization and compression schemes, which store only the differences between input sequence and a reference sequence, gained lots of interest in this field. Highly-similar sequences, e.g., Human genomes, can be compressed with a compression ratio of 1,000:1 and more, up to two orders of magnitude better than with standard compression techniques. Recently, it was shown that the compression against multiple references from the same species can boost the compression ratio up to 4,000:1. However, a detailed analysis of using multiple references is lacking, e.g., for main memory consumption and optimality. In this paper, we describe one key technique for the referential compression against multiple references: The factorization of sequences. Based on the notion of an optimal factorization, we propose optimization heuristics and identify parameter settings which greatly influence 1) the size of the factorization, 2) the time for factorization, and 3) the required amount of main memory. We evaluate a total of 30 setups with a varying number of references on data from three different species. Our results show a wide range of factorization sizes (optimal to an overhead of up to 300%), factorization speed (0.01 MB/s to more than 600 MB/s), and main memory usage (few dozen MB to dozens of GB). Based on our evaluation, we identify the best configurations for common use cases. Our evaluation shows that multi-reference factorization is much better than single-reference factorization.

No MeSH data available.