Limits...
Bitpacking techniques for indexing genomes: I. Hash tables.

Wu TD - Algorithms Mol Biol (2016)

Bottom Line: We propose to compress the offset array using vectorized bitpacking.Furthermore, among vectorized bitpacking schemes, our BP64-columnar format yields retrieval times that are faster than the fastest known bitpacking format by a factor of 3 for retrieving a single value, and a factor of 2 for retrieving two adjacent values.It also has potential applications to other domains requiring differential coding with random access.

View Article: PubMed Central - PubMed

Affiliation: Department of Bioinformatics and Computational Biology, Genentech, Inc., 1 DNA Way, 94080, South San Francisco, CA 94080 USA.

ABSTRACT

Background: Hash tables constitute a widely used data structure for indexing genomes that provides a list of genomic positions for each possible oligomer of a given size. The offset array in a hash table grows exponentially with the oligomer size and precludes the use of larger oligomers that could facilitate rapid alignment of sequences to a genome.

Results: We propose to compress the offset array using vectorized bitpacking. We introduce an algorithm and data structure called BP64-columnar that achieves fast random access in arrays of monotonically nondecreasing integers. Experimental results based on hash tables for the fly, chicken, and human genomes show that BP64-columnar is 3 to 4 times faster than publicly available implementations of universal coding schemes, such as Elias gamma, Elias delta, and Fibonacci compression. Furthermore, among vectorized bitpacking schemes, our BP64-columnar format yields retrieval times that are faster than the fastest known bitpacking format by a factor of 3 for retrieving a single value, and a factor of 2 for retrieving two adjacent values.

Conclusions: Our BP64-columnar scheme enables compression of genomic hash tables with fast retrieval. It also has potential applications to other domains requiring differential coding with random access.

No MeSH data available.


Related in: MedlinePlus

Decoding of columnar layout. An example is shown for decoding column 2 for quarter block Q2 from a block packed with a bit width of 6. Source code is listed in part (f), with key steps shown graphically in parts (a) through (e), and a final summation step in part (g). a, b Loading of two 128-bit vectors from the block. c Parallel (SIMD) right shift of each 32-bit word by 24 bits, to move , , , and  into the lowest 6 bits. d Recombining of , , , and , which are split between two 128-bit vectors in the block, using a parallel right shift of the first vector by 30 bits and a parallel left shift of the second vector by 2 bits. e Parallel addition of the first and second vector of differences. f Source code in the C language. Comments in the source correspond to the steps labeled a through e. g Difference results shown as an array of 32-bit words. The value  can be obtained by adding two terms from the array, while the value  can be obtained by adding four terms
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4835851&req=5

Fig5: Decoding of columnar layout. An example is shown for decoding column 2 for quarter block Q2 from a block packed with a bit width of 6. Source code is listed in part (f), with key steps shown graphically in parts (a) through (e), and a final summation step in part (g). a, b Loading of two 128-bit vectors from the block. c Parallel (SIMD) right shift of each 32-bit word by 24 bits, to move , , , and into the lowest 6 bits. d Recombining of , , , and , which are split between two 128-bit vectors in the block, using a parallel right shift of the first vector by 30 bits and a parallel left shift of the second vector by 2 bits. e Parallel addition of the first and second vector of differences. f Source code in the C language. Comments in the source correspond to the steps labeled a through e. g Difference results shown as an array of 32-bit words. The value can be obtained by adding two terms from the array, while the value can be obtained by adding four terms

Mentions: As an example of one of these 256 procedures, Fig. 5 shows a procedure for decoding column 2 for the Q2 quarter block from a block with a bit width of 6. The two decoding steps are also shown by the dashed boxes in Fig. 4b. The first decoding step yields four difference values, shown as words 0 through 3 in Fig. 5c, and the second decoding step yields another four difference values, shown in Fig. 5d. Our procedure then performs a final SIMD addition to add the first set of difference values to the second set, thereby yielding words 4 through 7 in Fig. 5e. Words 0 through 7 from Fig. 5c, e can be stored in an array, as shown in Fig. 5g.Fig. 5


Bitpacking techniques for indexing genomes: I. Hash tables.

Wu TD - Algorithms Mol Biol (2016)

Decoding of columnar layout. An example is shown for decoding column 2 for quarter block Q2 from a block packed with a bit width of 6. Source code is listed in part (f), with key steps shown graphically in parts (a) through (e), and a final summation step in part (g). a, b Loading of two 128-bit vectors from the block. c Parallel (SIMD) right shift of each 32-bit word by 24 bits, to move , , , and  into the lowest 6 bits. d Recombining of , , , and , which are split between two 128-bit vectors in the block, using a parallel right shift of the first vector by 30 bits and a parallel left shift of the second vector by 2 bits. e Parallel addition of the first and second vector of differences. f Source code in the C language. Comments in the source correspond to the steps labeled a through e. g Difference results shown as an array of 32-bit words. The value  can be obtained by adding two terms from the array, while the value  can be obtained by adding four terms
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4835851&req=5

Fig5: Decoding of columnar layout. An example is shown for decoding column 2 for quarter block Q2 from a block packed with a bit width of 6. Source code is listed in part (f), with key steps shown graphically in parts (a) through (e), and a final summation step in part (g). a, b Loading of two 128-bit vectors from the block. c Parallel (SIMD) right shift of each 32-bit word by 24 bits, to move , , , and into the lowest 6 bits. d Recombining of , , , and , which are split between two 128-bit vectors in the block, using a parallel right shift of the first vector by 30 bits and a parallel left shift of the second vector by 2 bits. e Parallel addition of the first and second vector of differences. f Source code in the C language. Comments in the source correspond to the steps labeled a through e. g Difference results shown as an array of 32-bit words. The value can be obtained by adding two terms from the array, while the value can be obtained by adding four terms
Mentions: As an example of one of these 256 procedures, Fig. 5 shows a procedure for decoding column 2 for the Q2 quarter block from a block with a bit width of 6. The two decoding steps are also shown by the dashed boxes in Fig. 4b. The first decoding step yields four difference values, shown as words 0 through 3 in Fig. 5c, and the second decoding step yields another four difference values, shown in Fig. 5d. Our procedure then performs a final SIMD addition to add the first set of difference values to the second set, thereby yielding words 4 through 7 in Fig. 5e. Words 0 through 7 from Fig. 5c, e can be stored in an array, as shown in Fig. 5g.Fig. 5

Bottom Line: We propose to compress the offset array using vectorized bitpacking.Furthermore, among vectorized bitpacking schemes, our BP64-columnar format yields retrieval times that are faster than the fastest known bitpacking format by a factor of 3 for retrieving a single value, and a factor of 2 for retrieving two adjacent values.It also has potential applications to other domains requiring differential coding with random access.

View Article: PubMed Central - PubMed

Affiliation: Department of Bioinformatics and Computational Biology, Genentech, Inc., 1 DNA Way, 94080, South San Francisco, CA 94080 USA.

ABSTRACT

Background: Hash tables constitute a widely used data structure for indexing genomes that provides a list of genomic positions for each possible oligomer of a given size. The offset array in a hash table grows exponentially with the oligomer size and precludes the use of larger oligomers that could facilitate rapid alignment of sequences to a genome.

Results: We propose to compress the offset array using vectorized bitpacking. We introduce an algorithm and data structure called BP64-columnar that achieves fast random access in arrays of monotonically nondecreasing integers. Experimental results based on hash tables for the fly, chicken, and human genomes show that BP64-columnar is 3 to 4 times faster than publicly available implementations of universal coding schemes, such as Elias gamma, Elias delta, and Fibonacci compression. Furthermore, among vectorized bitpacking schemes, our BP64-columnar format yields retrieval times that are faster than the fastest known bitpacking format by a factor of 3 for retrieving a single value, and a factor of 2 for retrieving two adjacent values.

Conclusions: Our BP64-columnar scheme enables compression of genomic hash tables with fast retrieval. It also has potential applications to other domains requiring differential coding with random access.

No MeSH data available.


Related in: MedlinePlus