Limits...
On-Demand Indexing for Referential Compression of DNA Sequences.

Alves F, Cogo V, Wandelt S, Leser U, Bessani A - PLoS ONE (2015)

Bottom Line: Referential compression is one of these techniques, in which the similarity between the DNA of organisms of the same or an evolutionary close species is exploited to reduce the storage demands of genome sequences up to 700 times.The general idea is to store in the compressed file only the differences between the to-be-compressed and a well-known reference sequence.Our approach, called On-Demand Indexing (ODI) compresses human chromosomes five to ten times faster than other state-of-the-art tools (on average), while achieving similar compression ratios.

View Article: PubMed Central - PubMed

Affiliation: LaSIGE, University of Lisbon, Lisbon, Portugal.

ABSTRACT
The decreasing costs of genome sequencing is creating a demand for scalable storage and processing tools and techniques to deal with the large amounts of generated data. Referential compression is one of these techniques, in which the similarity between the DNA of organisms of the same or an evolutionary close species is exploited to reduce the storage demands of genome sequences up to 700 times. The general idea is to store in the compressed file only the differences between the to-be-compressed and a well-known reference sequence. In this paper, we propose a method for improving the performance of referential compression by removing the most costly phase of the process, the complete reference indexing. Our approach, called On-Demand Indexing (ODI) compresses human chromosomes five to ten times faster than other state-of-the-art tools (on average), while achieving similar compression ratios.

No MeSH data available.


FRESCO’s execution time separated on its internal steps.Time each compression step takes for each chromosome (in proportion).
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4493149&req=5

pone.0132460.g002: FRESCO’s execution time separated on its internal steps.Time each compression step takes for each chromosome (in proportion).

Mentions: Fig 2 presents a composite analysis of FRESCO’s execution time for different chromosomes of a human genome (Section Results contains a description of our experimental environment). Each bar in this graph comprises the three previously described steps. The “Other” part is composed of several shorter steps, such as computing the encoding, file reading/writing, and variable initialization. This figure shows that most of FRESCO’s execution time is spent in the indexing phase. This phase spends more than 90% of total execution time because the construction of the K-mer table includes adding to this structure every K-mer present in the reference chromosome sequence. For example, the largest human chromosome generates a K-mer table with more than 250 million entries. The compression itself spends only about 1.5% of the execution time, and the remaining execution tasks spend about 4%.


On-Demand Indexing for Referential Compression of DNA Sequences.

Alves F, Cogo V, Wandelt S, Leser U, Bessani A - PLoS ONE (2015)

FRESCO’s execution time separated on its internal steps.Time each compression step takes for each chromosome (in proportion).
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4493149&req=5

pone.0132460.g002: FRESCO’s execution time separated on its internal steps.Time each compression step takes for each chromosome (in proportion).
Mentions: Fig 2 presents a composite analysis of FRESCO’s execution time for different chromosomes of a human genome (Section Results contains a description of our experimental environment). Each bar in this graph comprises the three previously described steps. The “Other” part is composed of several shorter steps, such as computing the encoding, file reading/writing, and variable initialization. This figure shows that most of FRESCO’s execution time is spent in the indexing phase. This phase spends more than 90% of total execution time because the construction of the K-mer table includes adding to this structure every K-mer present in the reference chromosome sequence. For example, the largest human chromosome generates a K-mer table with more than 250 million entries. The compression itself spends only about 1.5% of the execution time, and the remaining execution tasks spend about 4%.

Bottom Line: Referential compression is one of these techniques, in which the similarity between the DNA of organisms of the same or an evolutionary close species is exploited to reduce the storage demands of genome sequences up to 700 times.The general idea is to store in the compressed file only the differences between the to-be-compressed and a well-known reference sequence.Our approach, called On-Demand Indexing (ODI) compresses human chromosomes five to ten times faster than other state-of-the-art tools (on average), while achieving similar compression ratios.

View Article: PubMed Central - PubMed

Affiliation: LaSIGE, University of Lisbon, Lisbon, Portugal.

ABSTRACT
The decreasing costs of genome sequencing is creating a demand for scalable storage and processing tools and techniques to deal with the large amounts of generated data. Referential compression is one of these techniques, in which the similarity between the DNA of organisms of the same or an evolutionary close species is exploited to reduce the storage demands of genome sequences up to 700 times. The general idea is to store in the compressed file only the differences between the to-be-compressed and a well-known reference sequence. In this paper, we propose a method for improving the performance of referential compression by removing the most costly phase of the process, the complete reference indexing. Our approach, called On-Demand Indexing (ODI) compresses human chromosomes five to ten times faster than other state-of-the-art tools (on average), while achieving similar compression ratios.

No MeSH data available.