Limits...
Bitpacking techniques for indexing genomes: I. Hash tables.

Wu TD - Algorithms Mol Biol (2016)

Bottom Line: We propose to compress the offset array using vectorized bitpacking.Furthermore, among vectorized bitpacking schemes, our BP64-columnar format yields retrieval times that are faster than the fastest known bitpacking format by a factor of 3 for retrieving a single value, and a factor of 2 for retrieving two adjacent values.It also has potential applications to other domains requiring differential coding with random access.

View Article: PubMed Central - PubMed

Affiliation: Department of Bioinformatics and Computational Biology, Genentech, Inc., 1 DNA Way, 94080, South San Francisco, CA 94080 USA.

ABSTRACT

Background: Hash tables constitute a widely used data structure for indexing genomes that provides a list of genomic positions for each possible oligomer of a given size. The offset array in a hash table grows exponentially with the oligomer size and precludes the use of larger oligomers that could facilitate rapid alignment of sequences to a genome.

Results: We propose to compress the offset array using vectorized bitpacking. We introduce an algorithm and data structure called BP64-columnar that achieves fast random access in arrays of monotonically nondecreasing integers. Experimental results based on hash tables for the fly, chicken, and human genomes show that BP64-columnar is 3 to 4 times faster than publicly available implementations of universal coding schemes, such as Elias gamma, Elias delta, and Fibonacci compression. Furthermore, among vectorized bitpacking schemes, our BP64-columnar format yields retrieval times that are faster than the fastest known bitpacking format by a factor of 3 for retrieving a single value, and a factor of 2 for retrieving two adjacent values.

Conclusions: Our BP64-columnar scheme enables compression of genomic hash tables with fast retrieval. It also has potential applications to other domains requiring differential coding with random access.

No MeSH data available.


Related in: MedlinePlus

Decoding of vertical layout. An example is shown for the first two cycles of serial decoding of the vertical layout from a block packed with a bit width of 6. Shaded regions correspond to the values in Fig. 2b. Source code is shown in part (f), with key steps shown graphically in parts (a) through (e). a Loading of the first 128-bit vector from the block. b Masking of the first four difference values from the vector. c, d Shifting and masking of the second four difference values from the vector. e Parallel addition of the first and second vectors of difference values. f Source code in the C language. Comments in the source code correspond to the steps labeled a through e
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4835851&req=5

Fig3: Decoding of vertical layout. An example is shown for the first two cycles of serial decoding of the vertical layout from a block packed with a bit width of 6. Shaded regions correspond to the values in Fig. 2b. Source code is shown in part (f), with key steps shown graphically in parts (a) through (e). a Loading of the first 128-bit vector from the block. b Masking of the first four difference values from the vector. c, d Shifting and masking of the second four difference values from the vector. e Parallel addition of the first and second vectors of difference values. f Source code in the C language. Comments in the source code correspond to the steps labeled a through e

Mentions: In contrast, the vertical layout used in BP128 can be processed by SIMD operations in a more straightforward manner. This layout stripes integers in sets of four across each set of four words [11, 15, 16]. Figure 2b shows a vertical layout for a bit width of 6. Decoding in this layout loads four 32-bit words in parallel (Fig. 3a) and masks their low-order bits to obtain the first four values (Fig. 3b). Then, the next four values can be obtained by right-shifting the words in parallel by the uniform bit width for the block (Fig. 3c), and again masking their low-order bits (Fig. 3d). Since these values represent the differences between the original values, we use a SIMD operation to keep a running sum of the shifted-and-masked low-order bits (Fig. 3e). The remaining values in the block can be decoded using the same SIMD shift and mask steps, with a new set of four 32-bit words loaded whenever all of the currently loaded difference values have been processed. Source code for decoding a vertical layout is shown in Fig. 3f.Fig. 3


Bitpacking techniques for indexing genomes: I. Hash tables.

Wu TD - Algorithms Mol Biol (2016)

Decoding of vertical layout. An example is shown for the first two cycles of serial decoding of the vertical layout from a block packed with a bit width of 6. Shaded regions correspond to the values in Fig. 2b. Source code is shown in part (f), with key steps shown graphically in parts (a) through (e). a Loading of the first 128-bit vector from the block. b Masking of the first four difference values from the vector. c, d Shifting and masking of the second four difference values from the vector. e Parallel addition of the first and second vectors of difference values. f Source code in the C language. Comments in the source code correspond to the steps labeled a through e
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4835851&req=5

Fig3: Decoding of vertical layout. An example is shown for the first two cycles of serial decoding of the vertical layout from a block packed with a bit width of 6. Shaded regions correspond to the values in Fig. 2b. Source code is shown in part (f), with key steps shown graphically in parts (a) through (e). a Loading of the first 128-bit vector from the block. b Masking of the first four difference values from the vector. c, d Shifting and masking of the second four difference values from the vector. e Parallel addition of the first and second vectors of difference values. f Source code in the C language. Comments in the source code correspond to the steps labeled a through e
Mentions: In contrast, the vertical layout used in BP128 can be processed by SIMD operations in a more straightforward manner. This layout stripes integers in sets of four across each set of four words [11, 15, 16]. Figure 2b shows a vertical layout for a bit width of 6. Decoding in this layout loads four 32-bit words in parallel (Fig. 3a) and masks their low-order bits to obtain the first four values (Fig. 3b). Then, the next four values can be obtained by right-shifting the words in parallel by the uniform bit width for the block (Fig. 3c), and again masking their low-order bits (Fig. 3d). Since these values represent the differences between the original values, we use a SIMD operation to keep a running sum of the shifted-and-masked low-order bits (Fig. 3e). The remaining values in the block can be decoded using the same SIMD shift and mask steps, with a new set of four 32-bit words loaded whenever all of the currently loaded difference values have been processed. Source code for decoding a vertical layout is shown in Fig. 3f.Fig. 3

Bottom Line: We propose to compress the offset array using vectorized bitpacking.Furthermore, among vectorized bitpacking schemes, our BP64-columnar format yields retrieval times that are faster than the fastest known bitpacking format by a factor of 3 for retrieving a single value, and a factor of 2 for retrieving two adjacent values.It also has potential applications to other domains requiring differential coding with random access.

View Article: PubMed Central - PubMed

Affiliation: Department of Bioinformatics and Computational Biology, Genentech, Inc., 1 DNA Way, 94080, South San Francisco, CA 94080 USA.

ABSTRACT

Background: Hash tables constitute a widely used data structure for indexing genomes that provides a list of genomic positions for each possible oligomer of a given size. The offset array in a hash table grows exponentially with the oligomer size and precludes the use of larger oligomers that could facilitate rapid alignment of sequences to a genome.

Results: We propose to compress the offset array using vectorized bitpacking. We introduce an algorithm and data structure called BP64-columnar that achieves fast random access in arrays of monotonically nondecreasing integers. Experimental results based on hash tables for the fly, chicken, and human genomes show that BP64-columnar is 3 to 4 times faster than publicly available implementations of universal coding schemes, such as Elias gamma, Elias delta, and Fibonacci compression. Furthermore, among vectorized bitpacking schemes, our BP64-columnar format yields retrieval times that are faster than the fastest known bitpacking format by a factor of 3 for retrieving a single value, and a factor of 2 for retrieving two adjacent values.

Conclusions: Our BP64-columnar scheme enables compression of genomic hash tables with fast retrieval. It also has potential applications to other domains requiring differential coding with random access.

No MeSH data available.


Related in: MedlinePlus