Limits...
Compression of next-generation sequencing reads aided by highly efficient de novo assembly.

Jones DC, Ruzzo WL, Peng X, Katze MG - Nucleic Acids Res. (2012)

Bottom Line: A probabilistic data structure is used to dramatically reduce the memory required by traditional de Bruijn graph assemblers, allowing millions of reads to be assembled very efficiently.Read sequences are then stored as positions within the assembled contigs.This is combined with statistical compression of read identifiers, quality scores, alignment information and sequences, effectively collapsing very large data sets to <15% of their original size with no loss of information.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science and Engineering, University of Washington, Seattle, WA 98195-2350, USA. dcjones@cs.washington.edu

ABSTRACT

Unlabelled: We present Quip, a lossless compression algorithm for next-generation sequencing data in the FASTQ and SAM/BAM formats. In addition to implementing reference-based compression, we have developed, to our knowledge, the first assembly-based compressor, using a novel de novo assembly algorithm. A probabilistic data structure is used to dramatically reduce the memory required by traditional de Bruijn graph assemblers, allowing millions of reads to be assembled very efficiently. Read sequences are then stored as positions within the assembled contigs. This is combined with statistical compression of read identifiers, quality scores, alignment information and sequences, effectively collapsing very large data sets to <15% of their original size with no loss of information.

Availability: Quip is freely available under the 3-clause BSD license from http://cs.washington.edu/homes/dcjones/quip.

Show MeSH
One lane of sequencing data from each of six publicly available data sets (Table 2) was compressed using a variety of methods (Table 1). The size of the compressed data is plotted in proportion to the size of uncompressed data. Note that all methods evaluated are entirely lossless with the exception of Cramtools, marked in this plot with an asterisk, which does not preserve read IDs, giving it some advantage in these comparisons. Reference-based compression methods, in which an external sequence database is used, were not applied to the metagenomic data set for lack of an obvious reference. These plots are marked ‘N/A’.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3526293&req=5

gks754-F1: One lane of sequencing data from each of six publicly available data sets (Table 2) was compressed using a variety of methods (Table 1). The size of the compressed data is plotted in proportion to the size of uncompressed data. Note that all methods evaluated are entirely lossless with the exception of Cramtools, marked in this plot with an asterisk, which does not preserve read IDs, giving it some advantage in these comparisons. Reference-based compression methods, in which an external sequence database is used, were not applied to the metagenomic data set for lack of an obvious reference. These plots are marked ‘N/A’.

Mentions: With each method, we compressed then decompressed each data set on a server with a 2.8Ghz Intel Xeon processor and 64 GB of memory. In addition to recording the compressed file size (Figure 1), we measured wall clock run-time (Figure 3) and maximum memory usage (Table 3) using the Unix time command. Run time was normalized to the size of the original FASTQ file, measuring throughput in megabytes per second.Figure 1.


Compression of next-generation sequencing reads aided by highly efficient de novo assembly.

Jones DC, Ruzzo WL, Peng X, Katze MG - Nucleic Acids Res. (2012)

One lane of sequencing data from each of six publicly available data sets (Table 2) was compressed using a variety of methods (Table 1). The size of the compressed data is plotted in proportion to the size of uncompressed data. Note that all methods evaluated are entirely lossless with the exception of Cramtools, marked in this plot with an asterisk, which does not preserve read IDs, giving it some advantage in these comparisons. Reference-based compression methods, in which an external sequence database is used, were not applied to the metagenomic data set for lack of an obvious reference. These plots are marked ‘N/A’.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3526293&req=5

gks754-F1: One lane of sequencing data from each of six publicly available data sets (Table 2) was compressed using a variety of methods (Table 1). The size of the compressed data is plotted in proportion to the size of uncompressed data. Note that all methods evaluated are entirely lossless with the exception of Cramtools, marked in this plot with an asterisk, which does not preserve read IDs, giving it some advantage in these comparisons. Reference-based compression methods, in which an external sequence database is used, were not applied to the metagenomic data set for lack of an obvious reference. These plots are marked ‘N/A’.
Mentions: With each method, we compressed then decompressed each data set on a server with a 2.8Ghz Intel Xeon processor and 64 GB of memory. In addition to recording the compressed file size (Figure 1), we measured wall clock run-time (Figure 3) and maximum memory usage (Table 3) using the Unix time command. Run time was normalized to the size of the original FASTQ file, measuring throughput in megabytes per second.Figure 1.

Bottom Line: A probabilistic data structure is used to dramatically reduce the memory required by traditional de Bruijn graph assemblers, allowing millions of reads to be assembled very efficiently.Read sequences are then stored as positions within the assembled contigs.This is combined with statistical compression of read identifiers, quality scores, alignment information and sequences, effectively collapsing very large data sets to <15% of their original size with no loss of information.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science and Engineering, University of Washington, Seattle, WA 98195-2350, USA. dcjones@cs.washington.edu

ABSTRACT

Unlabelled: We present Quip, a lossless compression algorithm for next-generation sequencing data in the FASTQ and SAM/BAM formats. In addition to implementing reference-based compression, we have developed, to our knowledge, the first assembly-based compressor, using a novel de novo assembly algorithm. A probabilistic data structure is used to dramatically reduce the memory required by traditional de Bruijn graph assemblers, allowing millions of reads to be assembled very efficiently. Read sequences are then stored as positions within the assembled contigs. This is combined with statistical compression of read identifiers, quality scores, alignment information and sequences, effectively collapsing very large data sets to <15% of their original size with no loss of information.

Availability: Quip is freely available under the 3-clause BSD license from http://cs.washington.edu/homes/dcjones/quip.

Show MeSH