Limits...
Bitpacking techniques for indexing genomes: I. Hash tables.

Wu TD - Algorithms Mol Biol (2016)

Bottom Line: We propose to compress the offset array using vectorized bitpacking.Furthermore, among vectorized bitpacking schemes, our BP64-columnar format yields retrieval times that are faster than the fastest known bitpacking format by a factor of 3 for retrieving a single value, and a factor of 2 for retrieving two adjacent values.It also has potential applications to other domains requiring differential coding with random access.

View Article: PubMed Central - PubMed

Affiliation: Department of Bioinformatics and Computational Biology, Genentech, Inc., 1 DNA Way, 94080, South San Francisco, CA 94080 USA.

ABSTRACT

Background: Hash tables constitute a widely used data structure for indexing genomes that provides a list of genomic positions for each possible oligomer of a given size. The offset array in a hash table grows exponentially with the oligomer size and precludes the use of larger oligomers that could facilitate rapid alignment of sequences to a genome.

Results: We propose to compress the offset array using vectorized bitpacking. We introduce an algorithm and data structure called BP64-columnar that achieves fast random access in arrays of monotonically nondecreasing integers. Experimental results based on hash tables for the fly, chicken, and human genomes show that BP64-columnar is 3 to 4 times faster than publicly available implementations of universal coding schemes, such as Elias gamma, Elias delta, and Fibonacci compression. Furthermore, among vectorized bitpacking schemes, our BP64-columnar format yields retrieval times that are faster than the fastest known bitpacking format by a factor of 3 for retrieving a single value, and a factor of 2 for retrieving two adjacent values.

Conclusions: Our BP64-columnar scheme enables compression of genomic hash tables with fast retrieval. It also has potential applications to other domains requiring differential coding with random access.

No MeSH data available.


Related in: MedlinePlus

Hash table representation. a Standard representation using an offset array that indicates the start of genomic positions for a given k-mer. b Compressed representation where the offset array has been replaced by a bitstream and a metainformation array. The bitstream contains differences between offsets that have been compressed by bitpacking them into blocks of a given size. The metainformation array contains a pointer and a prefix sum for every block
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4835851&req=5

Fig1: Hash table representation. a Standard representation using an offset array that indicates the start of genomic positions for a given k-mer. b Compressed representation where the offset array has been replaced by a bitstream and a metainformation array. The bitstream contains differences between offsets that have been compressed by bitpacking them into blocks of a given size. The metainformation array contains a pointer and a prefix sum for every block

Mentions: In genomics, hash tables are typically implemented as a simple lookup table [8], in which an offset array contains pointers into a positions array, for the universe of possible k-mers (Fig. 1a). This straightforward table implementation is feasible because of the fixed and relatively small value of k needed for our domain. More general domains, having keys of arbitrary or relatively long length, require a hash function to compute a bucket index. Hash functions raise the possibility of collisions, where different keys map to the same bucket index, which then necessitate potentially complex and time-consuming procedures for handling such collisions.Fig. 1


Bitpacking techniques for indexing genomes: I. Hash tables.

Wu TD - Algorithms Mol Biol (2016)

Hash table representation. a Standard representation using an offset array that indicates the start of genomic positions for a given k-mer. b Compressed representation where the offset array has been replaced by a bitstream and a metainformation array. The bitstream contains differences between offsets that have been compressed by bitpacking them into blocks of a given size. The metainformation array contains a pointer and a prefix sum for every block
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4835851&req=5

Fig1: Hash table representation. a Standard representation using an offset array that indicates the start of genomic positions for a given k-mer. b Compressed representation where the offset array has been replaced by a bitstream and a metainformation array. The bitstream contains differences between offsets that have been compressed by bitpacking them into blocks of a given size. The metainformation array contains a pointer and a prefix sum for every block
Mentions: In genomics, hash tables are typically implemented as a simple lookup table [8], in which an offset array contains pointers into a positions array, for the universe of possible k-mers (Fig. 1a). This straightforward table implementation is feasible because of the fixed and relatively small value of k needed for our domain. More general domains, having keys of arbitrary or relatively long length, require a hash function to compute a bucket index. Hash functions raise the possibility of collisions, where different keys map to the same bucket index, which then necessitate potentially complex and time-consuming procedures for handling such collisions.Fig. 1

Bottom Line: We propose to compress the offset array using vectorized bitpacking.Furthermore, among vectorized bitpacking schemes, our BP64-columnar format yields retrieval times that are faster than the fastest known bitpacking format by a factor of 3 for retrieving a single value, and a factor of 2 for retrieving two adjacent values.It also has potential applications to other domains requiring differential coding with random access.

View Article: PubMed Central - PubMed

Affiliation: Department of Bioinformatics and Computational Biology, Genentech, Inc., 1 DNA Way, 94080, South San Francisco, CA 94080 USA.

ABSTRACT

Background: Hash tables constitute a widely used data structure for indexing genomes that provides a list of genomic positions for each possible oligomer of a given size. The offset array in a hash table grows exponentially with the oligomer size and precludes the use of larger oligomers that could facilitate rapid alignment of sequences to a genome.

Results: We propose to compress the offset array using vectorized bitpacking. We introduce an algorithm and data structure called BP64-columnar that achieves fast random access in arrays of monotonically nondecreasing integers. Experimental results based on hash tables for the fly, chicken, and human genomes show that BP64-columnar is 3 to 4 times faster than publicly available implementations of universal coding schemes, such as Elias gamma, Elias delta, and Fibonacci compression. Furthermore, among vectorized bitpacking schemes, our BP64-columnar format yields retrieval times that are faster than the fastest known bitpacking format by a factor of 3 for retrieving a single value, and a factor of 2 for retrieving two adjacent values.

Conclusions: Our BP64-columnar scheme enables compression of genomic hash tables with fast retrieval. It also has potential applications to other domains requiring differential coding with random access.

No MeSH data available.


Related in: MedlinePlus