Limits...
Bitpacking techniques for indexing genomes: I. Hash tables.

Wu TD - Algorithms Mol Biol (2016)

Bottom Line: We propose to compress the offset array using vectorized bitpacking.Furthermore, among vectorized bitpacking schemes, our BP64-columnar format yields retrieval times that are faster than the fastest known bitpacking format by a factor of 3 for retrieving a single value, and a factor of 2 for retrieving two adjacent values.It also has potential applications to other domains requiring differential coding with random access.

View Article: PubMed Central - PubMed

Affiliation: Department of Bioinformatics and Computational Biology, Genentech, Inc., 1 DNA Way, 94080, South San Francisco, CA 94080 USA.

ABSTRACT

Background: Hash tables constitute a widely used data structure for indexing genomes that provides a list of genomic positions for each possible oligomer of a given size. The offset array in a hash table grows exponentially with the oligomer size and precludes the use of larger oligomers that could facilitate rapid alignment of sequences to a genome.

Results: We propose to compress the offset array using vectorized bitpacking. We introduce an algorithm and data structure called BP64-columnar that achieves fast random access in arrays of monotonically nondecreasing integers. Experimental results based on hash tables for the fly, chicken, and human genomes show that BP64-columnar is 3 to 4 times faster than publicly available implementations of universal coding schemes, such as Elias gamma, Elias delta, and Fibonacci compression. Furthermore, among vectorized bitpacking schemes, our BP64-columnar format yields retrieval times that are faster than the fastest known bitpacking format by a factor of 3 for retrieving a single value, and a factor of 2 for retrieving two adjacent values.

Conclusions: Our BP64-columnar scheme enables compression of genomic hash tables with fast retrieval. It also has potential applications to other domains requiring differential coding with random access.

No MeSH data available.


Related in: MedlinePlus

Results for retrieval of offset array values. Space and time usage for retrieving offset array values under various compression schemes. Offsets are obtained from 15-mer hash tables for the fly (dm5), chicken (gg4), and human (hg19) genomes. Graphs show the retrieval time in nanoseconds per query as a function of the space required in bytes for each format. a Retrieval results for a single value. b Retrieval results for two adjacent values. Methods tested: storage as 32-bit integers (Int vector), or compressed using Elias gamma coding (Gamma), Elias delta (Delta), Fibonacci, BP64-vertical format, or the BP64-columnar and BP32-columnar schemes introduced here. For retrieving two adjacent values using vectorized bitpacking formats, results are shown for both two-pass (slower) and one-pass (faster) implementations
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4835851&req=5

Fig6: Results for retrieval of offset array values. Space and time usage for retrieving offset array values under various compression schemes. Offsets are obtained from 15-mer hash tables for the fly (dm5), chicken (gg4), and human (hg19) genomes. Graphs show the retrieval time in nanoseconds per query as a function of the space required in bytes for each format. a Retrieval results for a single value. b Retrieval results for two adjacent values. Methods tested: storage as 32-bit integers (Int vector), or compressed using Elias gamma coding (Gamma), Elias delta (Delta), Fibonacci, BP64-vertical format, or the BP64-columnar and BP32-columnar schemes introduced here. For retrieving two adjacent values using vectorized bitpacking formats, results are shown for both two-pass (slower) and one-pass (faster) implementations

Mentions: The results for retrieving a single offset value are shown in Fig. 6a. These results show that retrieval time is largely independent of genome size, which derives from the fact that the offsets file length depends instead on the k-mer size. The uncompressed format gives the fastest times at 12 ns/query, but requires 4 GB of space ( entries with 4 bytes required per entry). The Elias gamma, Elias delta, and Fibonacci formats produce compact representations that are 6–11 % of the uncompressed format, but generally have the slowest retrieval times, with over 130 ns/query for the fly genome, over 200 ns/query for the chicken genome, and 228–237 ns/query for human genomes. The BP64-vertical format requires slightly more space, at 8–14 % of the uncompressed format, but is 1.3–1.4 times faster than the SDSL methods, except for the fly genome, where the Elias gamma and delta methods are 10 % faster than the BP64-vertical method. The BP64-columnar format requires the same amount of space as BP64-vertical, but has retrieval times that are 2.7–3.0 times faster. The BP32-columnar format requires the most space among the bitpacking routines we tested, at 13–19 % of the uncompressed format (or 38–60 % more space than BP64-columnar), with times that are 12–15 % faster than BP64-columnar.Fig. 6


Bitpacking techniques for indexing genomes: I. Hash tables.

Wu TD - Algorithms Mol Biol (2016)

Results for retrieval of offset array values. Space and time usage for retrieving offset array values under various compression schemes. Offsets are obtained from 15-mer hash tables for the fly (dm5), chicken (gg4), and human (hg19) genomes. Graphs show the retrieval time in nanoseconds per query as a function of the space required in bytes for each format. a Retrieval results for a single value. b Retrieval results for two adjacent values. Methods tested: storage as 32-bit integers (Int vector), or compressed using Elias gamma coding (Gamma), Elias delta (Delta), Fibonacci, BP64-vertical format, or the BP64-columnar and BP32-columnar schemes introduced here. For retrieving two adjacent values using vectorized bitpacking formats, results are shown for both two-pass (slower) and one-pass (faster) implementations
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4835851&req=5

Fig6: Results for retrieval of offset array values. Space and time usage for retrieving offset array values under various compression schemes. Offsets are obtained from 15-mer hash tables for the fly (dm5), chicken (gg4), and human (hg19) genomes. Graphs show the retrieval time in nanoseconds per query as a function of the space required in bytes for each format. a Retrieval results for a single value. b Retrieval results for two adjacent values. Methods tested: storage as 32-bit integers (Int vector), or compressed using Elias gamma coding (Gamma), Elias delta (Delta), Fibonacci, BP64-vertical format, or the BP64-columnar and BP32-columnar schemes introduced here. For retrieving two adjacent values using vectorized bitpacking formats, results are shown for both two-pass (slower) and one-pass (faster) implementations
Mentions: The results for retrieving a single offset value are shown in Fig. 6a. These results show that retrieval time is largely independent of genome size, which derives from the fact that the offsets file length depends instead on the k-mer size. The uncompressed format gives the fastest times at 12 ns/query, but requires 4 GB of space ( entries with 4 bytes required per entry). The Elias gamma, Elias delta, and Fibonacci formats produce compact representations that are 6–11 % of the uncompressed format, but generally have the slowest retrieval times, with over 130 ns/query for the fly genome, over 200 ns/query for the chicken genome, and 228–237 ns/query for human genomes. The BP64-vertical format requires slightly more space, at 8–14 % of the uncompressed format, but is 1.3–1.4 times faster than the SDSL methods, except for the fly genome, where the Elias gamma and delta methods are 10 % faster than the BP64-vertical method. The BP64-columnar format requires the same amount of space as BP64-vertical, but has retrieval times that are 2.7–3.0 times faster. The BP32-columnar format requires the most space among the bitpacking routines we tested, at 13–19 % of the uncompressed format (or 38–60 % more space than BP64-columnar), with times that are 12–15 % faster than BP64-columnar.Fig. 6

Bottom Line: We propose to compress the offset array using vectorized bitpacking.Furthermore, among vectorized bitpacking schemes, our BP64-columnar format yields retrieval times that are faster than the fastest known bitpacking format by a factor of 3 for retrieving a single value, and a factor of 2 for retrieving two adjacent values.It also has potential applications to other domains requiring differential coding with random access.

View Article: PubMed Central - PubMed

Affiliation: Department of Bioinformatics and Computational Biology, Genentech, Inc., 1 DNA Way, 94080, South San Francisco, CA 94080 USA.

ABSTRACT

Background: Hash tables constitute a widely used data structure for indexing genomes that provides a list of genomic positions for each possible oligomer of a given size. The offset array in a hash table grows exponentially with the oligomer size and precludes the use of larger oligomers that could facilitate rapid alignment of sequences to a genome.

Results: We propose to compress the offset array using vectorized bitpacking. We introduce an algorithm and data structure called BP64-columnar that achieves fast random access in arrays of monotonically nondecreasing integers. Experimental results based on hash tables for the fly, chicken, and human genomes show that BP64-columnar is 3 to 4 times faster than publicly available implementations of universal coding schemes, such as Elias gamma, Elias delta, and Fibonacci compression. Furthermore, among vectorized bitpacking schemes, our BP64-columnar format yields retrieval times that are faster than the fastest known bitpacking format by a factor of 3 for retrieving a single value, and a factor of 2 for retrieving two adjacent values.

Conclusions: Our BP64-columnar scheme enables compression of genomic hash tables with fast retrieval. It also has potential applications to other domains requiring differential coding with random access.

No MeSH data available.


Related in: MedlinePlus