Limits...
Compression of next-generation sequencing reads aided by highly efficient de novo assembly.

Jones DC, Ruzzo WL, Peng X, Katze MG - Nucleic Acids Res. (2012)

Bottom Line: A probabilistic data structure is used to dramatically reduce the memory required by traditional de Bruijn graph assemblers, allowing millions of reads to be assembled very efficiently.Read sequences are then stored as positions within the assembled contigs.This is combined with statistical compression of read identifiers, quality scores, alignment information and sequences, effectively collapsing very large data sets to <15% of their original size with no loss of information.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science and Engineering, University of Washington, Seattle, WA 98195-2350, USA. dcjones@cs.washington.edu

ABSTRACT

Unlabelled: We present Quip, a lossless compression algorithm for next-generation sequencing data in the FASTQ and SAM/BAM formats. In addition to implementing reference-based compression, we have developed, to our knowledge, the first assembly-based compressor, using a novel de novo assembly algorithm. A probabilistic data structure is used to dramatically reduce the memory required by traditional de Bruijn graph assemblers, allowing millions of reads to be assembled very efficiently. Read sequences are then stored as positions within the assembled contigs. This is combined with statistical compression of read identifiers, quality scores, alignment information and sequences, effectively collapsing very large data sets to <15% of their original size with no loss of information.

Availability: Quip is freely available under the 3-clause BSD license from http://cs.washington.edu/homes/dcjones/quip.

Show MeSH

Related in: MedlinePlus

The trade-off between memory usage and false positive rate in the dlCBF is evaluated by inserting 10 million unique 25-mers into tables of increasing size. Memory usage is reported as the proportion of the memory used by a memory efficient hash table to do the same.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3526293&req=5

gks754-F4: The trade-off between memory usage and false positive rate in the dlCBF is evaluated by inserting 10 million unique 25-mers into tables of increasing size. Memory usage is reported as the proportion of the memory used by a memory efficient hash table to do the same.

Mentions: We find that with only 20% of the space of the hash table, the dlCBF accrues a collision rate of less than 0.001 (Figure 4). While the hash table performed the 10 million insertions in 7.34 s, it required only 4.48 s on average for the dlCBF to do the same, with both simulations run on a 2.8 Ghz Intel Xeon processor. Table size did not greatly affect the run time of the dlCBF. The assembler implemented in Quip uses a fixed table size, conservatively set to allow for 100 million unique k-mers with a collision rate of ∼0.08% in simulations.Figure 4.


Compression of next-generation sequencing reads aided by highly efficient de novo assembly.

Jones DC, Ruzzo WL, Peng X, Katze MG - Nucleic Acids Res. (2012)

The trade-off between memory usage and false positive rate in the dlCBF is evaluated by inserting 10 million unique 25-mers into tables of increasing size. Memory usage is reported as the proportion of the memory used by a memory efficient hash table to do the same.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3526293&req=5

gks754-F4: The trade-off between memory usage and false positive rate in the dlCBF is evaluated by inserting 10 million unique 25-mers into tables of increasing size. Memory usage is reported as the proportion of the memory used by a memory efficient hash table to do the same.
Mentions: We find that with only 20% of the space of the hash table, the dlCBF accrues a collision rate of less than 0.001 (Figure 4). While the hash table performed the 10 million insertions in 7.34 s, it required only 4.48 s on average for the dlCBF to do the same, with both simulations run on a 2.8 Ghz Intel Xeon processor. Table size did not greatly affect the run time of the dlCBF. The assembler implemented in Quip uses a fixed table size, conservatively set to allow for 100 million unique k-mers with a collision rate of ∼0.08% in simulations.Figure 4.

Bottom Line: A probabilistic data structure is used to dramatically reduce the memory required by traditional de Bruijn graph assemblers, allowing millions of reads to be assembled very efficiently.Read sequences are then stored as positions within the assembled contigs.This is combined with statistical compression of read identifiers, quality scores, alignment information and sequences, effectively collapsing very large data sets to <15% of their original size with no loss of information.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science and Engineering, University of Washington, Seattle, WA 98195-2350, USA. dcjones@cs.washington.edu

ABSTRACT

Unlabelled: We present Quip, a lossless compression algorithm for next-generation sequencing data in the FASTQ and SAM/BAM formats. In addition to implementing reference-based compression, we have developed, to our knowledge, the first assembly-based compressor, using a novel de novo assembly algorithm. A probabilistic data structure is used to dramatically reduce the memory required by traditional de Bruijn graph assemblers, allowing millions of reads to be assembled very efficiently. Read sequences are then stored as positions within the assembled contigs. This is combined with statistical compression of read identifiers, quality scores, alignment information and sequences, effectively collapsing very large data sets to <15% of their original size with no loss of information.

Availability: Quip is freely available under the 3-clause BSD license from http://cs.washington.edu/homes/dcjones/quip.

Show MeSH
Related in: MedlinePlus