Limits...
Bitpacking techniques for indexing genomes: I. Hash tables.

Wu TD - Algorithms Mol Biol (2016)

Bottom Line: We propose to compress the offset array using vectorized bitpacking.Furthermore, among vectorized bitpacking schemes, our BP64-columnar format yields retrieval times that are faster than the fastest known bitpacking format by a factor of 3 for retrieving a single value, and a factor of 2 for retrieving two adjacent values.It also has potential applications to other domains requiring differential coding with random access.

View Article: PubMed Central - PubMed

Affiliation: Department of Bioinformatics and Computational Biology, Genentech, Inc., 1 DNA Way, 94080, South San Francisco, CA 94080 USA.

ABSTRACT

Background: Hash tables constitute a widely used data structure for indexing genomes that provides a list of genomic positions for each possible oligomer of a given size. The offset array in a hash table grows exponentially with the oligomer size and precludes the use of larger oligomers that could facilitate rapid alignment of sequences to a genome.

Results: We propose to compress the offset array using vectorized bitpacking. We introduce an algorithm and data structure called BP64-columnar that achieves fast random access in arrays of monotonically nondecreasing integers. Experimental results based on hash tables for the fly, chicken, and human genomes show that BP64-columnar is 3 to 4 times faster than publicly available implementations of universal coding schemes, such as Elias gamma, Elias delta, and Fibonacci compression. Furthermore, among vectorized bitpacking schemes, our BP64-columnar format yields retrieval times that are faster than the fastest known bitpacking format by a factor of 3 for retrieving a single value, and a factor of 2 for retrieving two adjacent values.

Conclusions: Our BP64-columnar scheme enables compression of genomic hash tables with fast retrieval. It also has potential applications to other domains requiring differential coding with random access.

No MeSH data available.


Related in: MedlinePlus

Difference schemes for vectorized differential coding. These schemes show the bitstream and the metainformation values for a given block, with  and  indicating the positions in the bitstream where this block and the next block begin, and  and  indicating the prefix sum for the two blocks. These schemes show how the original ascending values x in a block can be converted to difference values d. a Unidirectional cyclic differences as used in BP128, except shown here for a block size of 64. The difference  involves the original values  and . An exception holds for the first row, where differences are taken relative to , the prefix sum for the block. Dashed boxes indicate the first two processing steps for the vertical layout. b Bidirectional cyclic difference scheme, which matches the unidirectional scheme for the first half of the block, but computes differences relative to  for the second half of the block. Dashed boxes indicate the first two processing steps for column 2. In both parts, shaded regions correspond to the values needed to compute the circled values, requiring 3 loads for the unidirectional scheme and 2 loads for the bidirectional scheme. The colors correspond to those in Fig. 2c, d. Q1–Q4 indicate quarter blocks, and are annotated with the total number of SIMD loads required
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4835851&req=5

Fig4: Difference schemes for vectorized differential coding. These schemes show the bitstream and the metainformation values for a given block, with and indicating the positions in the bitstream where this block and the next block begin, and and indicating the prefix sum for the two blocks. These schemes show how the original ascending values x in a block can be converted to difference values d. a Unidirectional cyclic differences as used in BP128, except shown here for a block size of 64. The difference involves the original values and . An exception holds for the first row, where differences are taken relative to , the prefix sum for the block. Dashed boxes indicate the first two processing steps for the vertical layout. b Bidirectional cyclic difference scheme, which matches the unidirectional scheme for the first half of the block, but computes differences relative to for the second half of the block. Dashed boxes indicate the first two processing steps for column 2. In both parts, shaded regions correspond to the values needed to compute the circled values, requiring 3 loads for the unidirectional scheme and 2 loads for the bidirectional scheme. The colors correspond to those in Fig. 2c, d. Q1–Q4 indicate quarter blocks, and are annotated with the total number of SIMD loads required

Mentions: In the BP128 algorithm, because four cumulative sums are maintained in parallel, it is not the differences between adjacent values (i.e., ) that we want for differential coding, but rather the differences between each value and the one four elements away (i.e., ). An exception holds for the first four differences in each block, which are computed for , , , and relative to the prefix sum for the beginning of the block, which acts as the starting point for computing the four cumulative sums. The overall cyclic difference scheme for a block of 64 is shown in Fig. 4a.Fig. 4


Bitpacking techniques for indexing genomes: I. Hash tables.

Wu TD - Algorithms Mol Biol (2016)

Difference schemes for vectorized differential coding. These schemes show the bitstream and the metainformation values for a given block, with  and  indicating the positions in the bitstream where this block and the next block begin, and  and  indicating the prefix sum for the two blocks. These schemes show how the original ascending values x in a block can be converted to difference values d. a Unidirectional cyclic differences as used in BP128, except shown here for a block size of 64. The difference  involves the original values  and . An exception holds for the first row, where differences are taken relative to , the prefix sum for the block. Dashed boxes indicate the first two processing steps for the vertical layout. b Bidirectional cyclic difference scheme, which matches the unidirectional scheme for the first half of the block, but computes differences relative to  for the second half of the block. Dashed boxes indicate the first two processing steps for column 2. In both parts, shaded regions correspond to the values needed to compute the circled values, requiring 3 loads for the unidirectional scheme and 2 loads for the bidirectional scheme. The colors correspond to those in Fig. 2c, d. Q1–Q4 indicate quarter blocks, and are annotated with the total number of SIMD loads required
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4835851&req=5

Fig4: Difference schemes for vectorized differential coding. These schemes show the bitstream and the metainformation values for a given block, with and indicating the positions in the bitstream where this block and the next block begin, and and indicating the prefix sum for the two blocks. These schemes show how the original ascending values x in a block can be converted to difference values d. a Unidirectional cyclic differences as used in BP128, except shown here for a block size of 64. The difference involves the original values and . An exception holds for the first row, where differences are taken relative to , the prefix sum for the block. Dashed boxes indicate the first two processing steps for the vertical layout. b Bidirectional cyclic difference scheme, which matches the unidirectional scheme for the first half of the block, but computes differences relative to for the second half of the block. Dashed boxes indicate the first two processing steps for column 2. In both parts, shaded regions correspond to the values needed to compute the circled values, requiring 3 loads for the unidirectional scheme and 2 loads for the bidirectional scheme. The colors correspond to those in Fig. 2c, d. Q1–Q4 indicate quarter blocks, and are annotated with the total number of SIMD loads required
Mentions: In the BP128 algorithm, because four cumulative sums are maintained in parallel, it is not the differences between adjacent values (i.e., ) that we want for differential coding, but rather the differences between each value and the one four elements away (i.e., ). An exception holds for the first four differences in each block, which are computed for , , , and relative to the prefix sum for the beginning of the block, which acts as the starting point for computing the four cumulative sums. The overall cyclic difference scheme for a block of 64 is shown in Fig. 4a.Fig. 4

Bottom Line: We propose to compress the offset array using vectorized bitpacking.Furthermore, among vectorized bitpacking schemes, our BP64-columnar format yields retrieval times that are faster than the fastest known bitpacking format by a factor of 3 for retrieving a single value, and a factor of 2 for retrieving two adjacent values.It also has potential applications to other domains requiring differential coding with random access.

View Article: PubMed Central - PubMed

Affiliation: Department of Bioinformatics and Computational Biology, Genentech, Inc., 1 DNA Way, 94080, South San Francisco, CA 94080 USA.

ABSTRACT

Background: Hash tables constitute a widely used data structure for indexing genomes that provides a list of genomic positions for each possible oligomer of a given size. The offset array in a hash table grows exponentially with the oligomer size and precludes the use of larger oligomers that could facilitate rapid alignment of sequences to a genome.

Results: We propose to compress the offset array using vectorized bitpacking. We introduce an algorithm and data structure called BP64-columnar that achieves fast random access in arrays of monotonically nondecreasing integers. Experimental results based on hash tables for the fly, chicken, and human genomes show that BP64-columnar is 3 to 4 times faster than publicly available implementations of universal coding schemes, such as Elias gamma, Elias delta, and Fibonacci compression. Furthermore, among vectorized bitpacking schemes, our BP64-columnar format yields retrieval times that are faster than the fastest known bitpacking format by a factor of 3 for retrieving a single value, and a factor of 2 for retrieving two adjacent values.

Conclusions: Our BP64-columnar scheme enables compression of genomic hash tables with fast retrieval. It also has potential applications to other domains requiring differential coding with random access.

No MeSH data available.


Related in: MedlinePlus