Limits...
Sequence Factorization with Multiple References.

Wandelt S, Leser U - PLoS ONE (2015)

Bottom Line: Our results show a wide range of factorization sizes (optimal to an overhead of up to 300%), factorization speed (0.01 MB/s to more than 600 MB/s), and main memory usage (few dozen MB to dozens of GB).Based on our evaluation, we identify the best configurations for common use cases.Our evaluation shows that multi-reference factorization is much better than single-reference factorization.

View Article: PubMed Central - PubMed

Affiliation: Knowledge Management in Bioinformatics, Humboldt-University of Berlin, Rudower Chaussee 25, 12489 Berlin, Germany.

ABSTRACT
The success of high-throughput sequencing has lead to an increasing number of projects which sequence large populations of a species. Storage and analysis of sequence data is a key challenge in these projects, because of the sheer size of the datasets. Compression is one simple technology to deal with this challenge. Referential factorization and compression schemes, which store only the differences between input sequence and a reference sequence, gained lots of interest in this field. Highly-similar sequences, e.g., Human genomes, can be compressed with a compression ratio of 1,000:1 and more, up to two orders of magnitude better than with standard compression techniques. Recently, it was shown that the compression against multiple references from the same species can boost the compression ratio up to 4,000:1. However, a detailed analysis of using multiple references is lacking, e.g., for main memory consumption and optimality. In this paper, we describe one key technique for the referential compression against multiple references: The factorization of sequences. Based on the notion of an optimal factorization, we propose optimization heuristics and identify parameter settings which greatly influence 1) the size of the factorization, 2) the time for factorization, and 3) the required amount of main memory. We evaluate a total of 30 setups with a varying number of references on data from three different species. Our results show a wide range of factorization sizes (optimal to an overhead of up to 300%), factorization speed (0.01 MB/s to more than 600 MB/s), and main memory usage (few dozen MB to dozens of GB). Based on our evaluation, we identify the best configurations for common use cases. Our evaluation shows that multi-reference factorization is much better than single-reference factorization.

No MeSH data available.


Effectiveness of SUCC/POSI on factorization time for an increasing number of sequences in the RME graph.A number of 10 sequences allows for fast factorization, while keeping the memory overhead low.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4589410&req=5

pone.0139000.g012: Effectiveness of SUCC/POSI on factorization time for an increasing number of sequences in the RME graph.A number of 10 sequences allows for fast factorization, while keeping the memory overhead low.

Mentions: Next we analyzed the effect of RME graphs on factorization speed. Our analysis showed that it is not beneficial to store the RME graph of all previously factorized sequences for two reasons. First, the memory usage is linearly increasing with each sequence. Second, the time spent on checking RME graph predictions increases with the number of sequences in the RME graph. We measured the time spend on SUCC/POSI prediction and index lookups separately for an increasing number of factorized sequences, to find out how many sequences are actually necessary. The result is shown in Fig 12. For all datasets, RME prediction based on the RME graph shows a significant reduction of factorization time for the first few sequences. More than 10 sequences in the RME graph do not improve factorization times significantly. In fact, we found that the required number of index lookups is roughly constant with more than 10 sequences in the RME graph. Overall, POSI reduces the factorization time further for HG21 and all other human genomes (data not shown), but for AT and yeast, there is no significant speedup from SUCC to POSI. It seems that for these not-so-similar datasets, prediction based on the previous RME (SUCC) is already sufficient.


Sequence Factorization with Multiple References.

Wandelt S, Leser U - PLoS ONE (2015)

Effectiveness of SUCC/POSI on factorization time for an increasing number of sequences in the RME graph.A number of 10 sequences allows for fast factorization, while keeping the memory overhead low.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4589410&req=5

pone.0139000.g012: Effectiveness of SUCC/POSI on factorization time for an increasing number of sequences in the RME graph.A number of 10 sequences allows for fast factorization, while keeping the memory overhead low.
Mentions: Next we analyzed the effect of RME graphs on factorization speed. Our analysis showed that it is not beneficial to store the RME graph of all previously factorized sequences for two reasons. First, the memory usage is linearly increasing with each sequence. Second, the time spent on checking RME graph predictions increases with the number of sequences in the RME graph. We measured the time spend on SUCC/POSI prediction and index lookups separately for an increasing number of factorized sequences, to find out how many sequences are actually necessary. The result is shown in Fig 12. For all datasets, RME prediction based on the RME graph shows a significant reduction of factorization time for the first few sequences. More than 10 sequences in the RME graph do not improve factorization times significantly. In fact, we found that the required number of index lookups is roughly constant with more than 10 sequences in the RME graph. Overall, POSI reduces the factorization time further for HG21 and all other human genomes (data not shown), but for AT and yeast, there is no significant speedup from SUCC to POSI. It seems that for these not-so-similar datasets, prediction based on the previous RME (SUCC) is already sufficient.

Bottom Line: Our results show a wide range of factorization sizes (optimal to an overhead of up to 300%), factorization speed (0.01 MB/s to more than 600 MB/s), and main memory usage (few dozen MB to dozens of GB).Based on our evaluation, we identify the best configurations for common use cases.Our evaluation shows that multi-reference factorization is much better than single-reference factorization.

View Article: PubMed Central - PubMed

Affiliation: Knowledge Management in Bioinformatics, Humboldt-University of Berlin, Rudower Chaussee 25, 12489 Berlin, Germany.

ABSTRACT
The success of high-throughput sequencing has lead to an increasing number of projects which sequence large populations of a species. Storage and analysis of sequence data is a key challenge in these projects, because of the sheer size of the datasets. Compression is one simple technology to deal with this challenge. Referential factorization and compression schemes, which store only the differences between input sequence and a reference sequence, gained lots of interest in this field. Highly-similar sequences, e.g., Human genomes, can be compressed with a compression ratio of 1,000:1 and more, up to two orders of magnitude better than with standard compression techniques. Recently, it was shown that the compression against multiple references from the same species can boost the compression ratio up to 4,000:1. However, a detailed analysis of using multiple references is lacking, e.g., for main memory consumption and optimality. In this paper, we describe one key technique for the referential compression against multiple references: The factorization of sequences. Based on the notion of an optimal factorization, we propose optimization heuristics and identify parameter settings which greatly influence 1) the size of the factorization, 2) the time for factorization, and 3) the required amount of main memory. We evaluate a total of 30 setups with a varying number of references on data from three different species. Our results show a wide range of factorization sizes (optimal to an overhead of up to 300%), factorization speed (0.01 MB/s to more than 600 MB/s), and main memory usage (few dozen MB to dozens of GB). Based on our evaluation, we identify the best configurations for common use cases. Our evaluation shows that multi-reference factorization is much better than single-reference factorization.

No MeSH data available.