Limits...
Sequence Factorization with Multiple References.

Wandelt S, Leser U - PLoS ONE (2015)

Bottom Line: Our results show a wide range of factorization sizes (optimal to an overhead of up to 300%), factorization speed (0.01 MB/s to more than 600 MB/s), and main memory usage (few dozen MB to dozens of GB).Based on our evaluation, we identify the best configurations for common use cases.Our evaluation shows that multi-reference factorization is much better than single-reference factorization.

View Article: PubMed Central - PubMed

Affiliation: Knowledge Management in Bioinformatics, Humboldt-University of Berlin, Rudower Chaussee 25, 12489 Berlin, Germany.

ABSTRACT
The success of high-throughput sequencing has lead to an increasing number of projects which sequence large populations of a species. Storage and analysis of sequence data is a key challenge in these projects, because of the sheer size of the datasets. Compression is one simple technology to deal with this challenge. Referential factorization and compression schemes, which store only the differences between input sequence and a reference sequence, gained lots of interest in this field. Highly-similar sequences, e.g., Human genomes, can be compressed with a compression ratio of 1,000:1 and more, up to two orders of magnitude better than with standard compression techniques. Recently, it was shown that the compression against multiple references from the same species can boost the compression ratio up to 4,000:1. However, a detailed analysis of using multiple references is lacking, e.g., for main memory consumption and optimality. In this paper, we describe one key technique for the referential compression against multiple references: The factorization of sequences. Based on the notion of an optimal factorization, we propose optimization heuristics and identify parameter settings which greatly influence 1) the size of the factorization, 2) the time for factorization, and 3) the required amount of main memory. We evaluate a total of 30 setups with a varying number of references on data from three different species. Our results show a wide range of factorization sizes (optimal to an overhead of up to 300%), factorization speed (0.01 MB/s to more than 600 MB/s), and main memory usage (few dozen MB to dozens of GB). Based on our evaluation, we identify the best configurations for common use cases. Our evaluation shows that multi-reference factorization is much better than single-reference factorization.

No MeSH data available.


Ranking of the Top-5 and worst method for four scenarios: 1) equal weighting factors, 2) preference on factorization speed, 3) preference on small factorizations, and 4) preference on low memory footprint.We show the original criteria values for three datasets (AT5, HG21, yeast).
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4589410&req=5

pone.0139000.g009: Ranking of the Top-5 and worst method for four scenarios: 1) equal weighting factors, 2) preference on factorization speed, 3) preference on small factorizations, and 4) preference on low memory footprint.We show the original criteria values for three datasets (AT5, HG21, yeast).

Mentions: Our evaluation gives an overview over the wide range of performances for 30 factorization techniques against multiple references. If a user, however, has to select a method for factorization, it is not clear which one is actually the best method. The selection of a good factorization method is difficult and depends on multiple criteria: Expected factorization speed, optimality guarantees, and required main memory during factorization. Therefore, the selection of a factorization technique is a typical multi-criteria decision analysis (MCDA) problem. We solve the following MCDA-problem below: Given up to 10 references and preferences weights on factorization size, factorization speed, and used main memory, which method is appropriate and how many references should be used? We used a commonly used MCDA technique TOPSIS [47, 48] to find solutions to the following four scenarios: 1) overall best solution, 2) preference on small factorizations, 3) preference on fast factorization, and 4) preference on a small memory footprint. We analyzed 300 configurations, by associating the number of references with each of the standard configurations, for instance, CST_LOMA_N_2 is CST_LOMA_N with two references. For each of the 300 configurations we recorded the number of RMEs, factorization speed (in MB/s), and memory footprint (in MB) for each of the six datasets. The result is shown in Fig 9; for brevity we show the ranks and values for AT5, HG21, and yeast only.


Sequence Factorization with Multiple References.

Wandelt S, Leser U - PLoS ONE (2015)

Ranking of the Top-5 and worst method for four scenarios: 1) equal weighting factors, 2) preference on factorization speed, 3) preference on small factorizations, and 4) preference on low memory footprint.We show the original criteria values for three datasets (AT5, HG21, yeast).
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4589410&req=5

pone.0139000.g009: Ranking of the Top-5 and worst method for four scenarios: 1) equal weighting factors, 2) preference on factorization speed, 3) preference on small factorizations, and 4) preference on low memory footprint.We show the original criteria values for three datasets (AT5, HG21, yeast).
Mentions: Our evaluation gives an overview over the wide range of performances for 30 factorization techniques against multiple references. If a user, however, has to select a method for factorization, it is not clear which one is actually the best method. The selection of a good factorization method is difficult and depends on multiple criteria: Expected factorization speed, optimality guarantees, and required main memory during factorization. Therefore, the selection of a factorization technique is a typical multi-criteria decision analysis (MCDA) problem. We solve the following MCDA-problem below: Given up to 10 references and preferences weights on factorization size, factorization speed, and used main memory, which method is appropriate and how many references should be used? We used a commonly used MCDA technique TOPSIS [47, 48] to find solutions to the following four scenarios: 1) overall best solution, 2) preference on small factorizations, 3) preference on fast factorization, and 4) preference on a small memory footprint. We analyzed 300 configurations, by associating the number of references with each of the standard configurations, for instance, CST_LOMA_N_2 is CST_LOMA_N with two references. For each of the 300 configurations we recorded the number of RMEs, factorization speed (in MB/s), and memory footprint (in MB) for each of the six datasets. The result is shown in Fig 9; for brevity we show the ranks and values for AT5, HG21, and yeast only.

Bottom Line: Our results show a wide range of factorization sizes (optimal to an overhead of up to 300%), factorization speed (0.01 MB/s to more than 600 MB/s), and main memory usage (few dozen MB to dozens of GB).Based on our evaluation, we identify the best configurations for common use cases.Our evaluation shows that multi-reference factorization is much better than single-reference factorization.

View Article: PubMed Central - PubMed

Affiliation: Knowledge Management in Bioinformatics, Humboldt-University of Berlin, Rudower Chaussee 25, 12489 Berlin, Germany.

ABSTRACT
The success of high-throughput sequencing has lead to an increasing number of projects which sequence large populations of a species. Storage and analysis of sequence data is a key challenge in these projects, because of the sheer size of the datasets. Compression is one simple technology to deal with this challenge. Referential factorization and compression schemes, which store only the differences between input sequence and a reference sequence, gained lots of interest in this field. Highly-similar sequences, e.g., Human genomes, can be compressed with a compression ratio of 1,000:1 and more, up to two orders of magnitude better than with standard compression techniques. Recently, it was shown that the compression against multiple references from the same species can boost the compression ratio up to 4,000:1. However, a detailed analysis of using multiple references is lacking, e.g., for main memory consumption and optimality. In this paper, we describe one key technique for the referential compression against multiple references: The factorization of sequences. Based on the notion of an optimal factorization, we propose optimization heuristics and identify parameter settings which greatly influence 1) the size of the factorization, 2) the time for factorization, and 3) the required amount of main memory. We evaluate a total of 30 setups with a varying number of references on data from three different species. Our results show a wide range of factorization sizes (optimal to an overhead of up to 300%), factorization speed (0.01 MB/s to more than 600 MB/s), and main memory usage (few dozen MB to dozens of GB). Based on our evaluation, we identify the best configurations for common use cases. Our evaluation shows that multi-reference factorization is much better than single-reference factorization.

No MeSH data available.