Limits...
On-Demand Indexing for Referential Compression of DNA Sequences.

Alves F, Cogo V, Wandelt S, Leser U, Bessani A - PLoS ONE (2015)

Bottom Line: Referential compression is one of these techniques, in which the similarity between the DNA of organisms of the same or an evolutionary close species is exploited to reduce the storage demands of genome sequences up to 700 times.The general idea is to store in the compressed file only the differences between the to-be-compressed and a well-known reference sequence.Our approach, called On-Demand Indexing (ODI) compresses human chromosomes five to ten times faster than other state-of-the-art tools (on average), while achieving similar compression ratios.

View Article: PubMed Central - PubMed

Affiliation: LaSIGE, University of Lisbon, Lisbon, Portugal.

ABSTRACT
The decreasing costs of genome sequencing is creating a demand for scalable storage and processing tools and techniques to deal with the large amounts of generated data. Referential compression is one of these techniques, in which the similarity between the DNA of organisms of the same or an evolutionary close species is exploited to reduce the storage demands of genome sequences up to 700 times. The general idea is to store in the compressed file only the differences between the to-be-compressed and a well-known reference sequence. In this paper, we propose a method for improving the performance of referential compression by removing the most costly phase of the process, the complete reference indexing. Our approach, called On-Demand Indexing (ODI) compresses human chromosomes five to ten times faster than other state-of-the-art tools (on average), while achieving similar compression ratios.

No MeSH data available.


Correlation between the number of variations known by chromosome, and the compression ratio for that chromosome.There are two clear clusters with a similar compression ratio, and the chromosome X has significantly less variations, leading to a higher compression ratio.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4493149&req=5

pone.0132460.g009: Correlation between the number of variations known by chromosome, and the compression ratio for that chromosome.There are two clear clusters with a similar compression ratio, and the chromosome X has significantly less variations, leading to a higher compression ratio.

Mentions: Additionally, the compression ratio in chromosome X deviates from the results in other chromosomes. Genomic diversity is smaller in sex chromosomes than in autosomes [21], which explains this outlier in chromosome X. The more genomic variations a block of nucleotides has, the lower compression ratio it results. We obtained the genomic diversity of each chromosome by dividing its approximate size by the number of its known genomic variations available in dbSNP (http://www.ncbi.nlm.nih.gov/snp/, see also the S4 Table). Fig 9 correlates the genomic diversity and the compression ratio of each chromosome, where there are two clusters of points near the variation ratio 20 and the chromosome X isolated near the variation ratio 33. The chromosome X has the least amount of variations [21] per group of nucleotides, which results in the highest compression ratio. On the other hand, for example, yeast genomes can have substantial differences, which results in low compression ratios even from referential compression algorithms [22].


On-Demand Indexing for Referential Compression of DNA Sequences.

Alves F, Cogo V, Wandelt S, Leser U, Bessani A - PLoS ONE (2015)

Correlation between the number of variations known by chromosome, and the compression ratio for that chromosome.There are two clear clusters with a similar compression ratio, and the chromosome X has significantly less variations, leading to a higher compression ratio.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4493149&req=5

pone.0132460.g009: Correlation between the number of variations known by chromosome, and the compression ratio for that chromosome.There are two clear clusters with a similar compression ratio, and the chromosome X has significantly less variations, leading to a higher compression ratio.
Mentions: Additionally, the compression ratio in chromosome X deviates from the results in other chromosomes. Genomic diversity is smaller in sex chromosomes than in autosomes [21], which explains this outlier in chromosome X. The more genomic variations a block of nucleotides has, the lower compression ratio it results. We obtained the genomic diversity of each chromosome by dividing its approximate size by the number of its known genomic variations available in dbSNP (http://www.ncbi.nlm.nih.gov/snp/, see also the S4 Table). Fig 9 correlates the genomic diversity and the compression ratio of each chromosome, where there are two clusters of points near the variation ratio 20 and the chromosome X isolated near the variation ratio 33. The chromosome X has the least amount of variations [21] per group of nucleotides, which results in the highest compression ratio. On the other hand, for example, yeast genomes can have substantial differences, which results in low compression ratios even from referential compression algorithms [22].

Bottom Line: Referential compression is one of these techniques, in which the similarity between the DNA of organisms of the same or an evolutionary close species is exploited to reduce the storage demands of genome sequences up to 700 times.The general idea is to store in the compressed file only the differences between the to-be-compressed and a well-known reference sequence.Our approach, called On-Demand Indexing (ODI) compresses human chromosomes five to ten times faster than other state-of-the-art tools (on average), while achieving similar compression ratios.

View Article: PubMed Central - PubMed

Affiliation: LaSIGE, University of Lisbon, Lisbon, Portugal.

ABSTRACT
The decreasing costs of genome sequencing is creating a demand for scalable storage and processing tools and techniques to deal with the large amounts of generated data. Referential compression is one of these techniques, in which the similarity between the DNA of organisms of the same or an evolutionary close species is exploited to reduce the storage demands of genome sequences up to 700 times. The general idea is to store in the compressed file only the differences between the to-be-compressed and a well-known reference sequence. In this paper, we propose a method for improving the performance of referential compression by removing the most costly phase of the process, the complete reference indexing. Our approach, called On-Demand Indexing (ODI) compresses human chromosomes five to ten times faster than other state-of-the-art tools (on average), while achieving similar compression ratios.

No MeSH data available.