Limits...
Bitpacking techniques for indexing genomes: I. Hash tables.

Wu TD - Algorithms Mol Biol (2016)

Bottom Line: We propose to compress the offset array using vectorized bitpacking.Furthermore, among vectorized bitpacking schemes, our BP64-columnar format yields retrieval times that are faster than the fastest known bitpacking format by a factor of 3 for retrieving a single value, and a factor of 2 for retrieving two adjacent values.It also has potential applications to other domains requiring differential coding with random access.

View Article: PubMed Central - PubMed

Affiliation: Department of Bioinformatics and Computational Biology, Genentech, Inc., 1 DNA Way, 94080, South San Francisco, CA 94080 USA.

ABSTRACT

Background: Hash tables constitute a widely used data structure for indexing genomes that provides a list of genomic positions for each possible oligomer of a given size. The offset array in a hash table grows exponentially with the oligomer size and precludes the use of larger oligomers that could facilitate rapid alignment of sequences to a genome.

Results: We propose to compress the offset array using vectorized bitpacking. We introduce an algorithm and data structure called BP64-columnar that achieves fast random access in arrays of monotonically nondecreasing integers. Experimental results based on hash tables for the fly, chicken, and human genomes show that BP64-columnar is 3 to 4 times faster than publicly available implementations of universal coding schemes, such as Elias gamma, Elias delta, and Fibonacci compression. Furthermore, among vectorized bitpacking schemes, our BP64-columnar format yields retrieval times that are faster than the fastest known bitpacking format by a factor of 3 for retrieving a single value, and a factor of 2 for retrieving two adjacent values.

Conclusions: Our BP64-columnar scheme enables compression of genomic hash tables with fast retrieval. It also has potential applications to other domains requiring differential coding with random access.

No MeSH data available.


Related in: MedlinePlus

Packing layouts for block-based differential coding. Each layout encodes a block of 64 difference values using 6 bits for each value. Values are placed into three 128-bit registers, each containing four 32-bit words. Each small box shows a 6-bit quantity, labeled with an integer r that indicates the place for storing the difference value . Some 6-bit quantities are spread over two words, as shown by dashed lines. Blocks of color indicate parallel processing of the difference values needed to decode the value . aHorizontal layout, with values stored in index order. bVertical layout, with values striped across each set of words, as used in BP128. c Columnar layout, unidirectional scheme, with indices at increments of 4 striped across each set of words. d Columnar layout, bidirectional scheme
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4835851&req=5

Fig2: Packing layouts for block-based differential coding. Each layout encodes a block of 64 difference values using 6 bits for each value. Values are placed into three 128-bit registers, each containing four 32-bit words. Each small box shows a 6-bit quantity, labeled with an integer r that indicates the place for storing the difference value . Some 6-bit quantities are spread over two words, as shown by dashed lines. Blocks of color indicate parallel processing of the difference values needed to decode the value . aHorizontal layout, with values stored in index order. bVertical layout, with values striped across each set of words, as used in BP128. c Columnar layout, unidirectional scheme, with indices at increments of 4 striped across each set of words. d Columnar layout, bidirectional scheme

Mentions: When the bit width for a block is less than 32, more than one integer can be stored in each 32-bit word. The way in which integers are arranged in words can be considered a layout. Two layouts have been proposed so far for vectorized bitpacking. In a horizontal layout, adjacent integers are packed next to each other [14]. Figure 2a shows the horizontal layout for a bit width of 6, where each row represents a 128-bit vector of four 32-bit words that are accessed in parallel. Integers that are processed in parallel are shaded with the same color. In the decoding process, this layout requires a separate shift operation for each integer in the word. Although SIMD operations exist for performing distinct shifts for the words in a register, they entail a fair amount of complexity in the decoding algorithm.Fig. 2


Bitpacking techniques for indexing genomes: I. Hash tables.

Wu TD - Algorithms Mol Biol (2016)

Packing layouts for block-based differential coding. Each layout encodes a block of 64 difference values using 6 bits for each value. Values are placed into three 128-bit registers, each containing four 32-bit words. Each small box shows a 6-bit quantity, labeled with an integer r that indicates the place for storing the difference value . Some 6-bit quantities are spread over two words, as shown by dashed lines. Blocks of color indicate parallel processing of the difference values needed to decode the value . aHorizontal layout, with values stored in index order. bVertical layout, with values striped across each set of words, as used in BP128. c Columnar layout, unidirectional scheme, with indices at increments of 4 striped across each set of words. d Columnar layout, bidirectional scheme
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4835851&req=5

Fig2: Packing layouts for block-based differential coding. Each layout encodes a block of 64 difference values using 6 bits for each value. Values are placed into three 128-bit registers, each containing four 32-bit words. Each small box shows a 6-bit quantity, labeled with an integer r that indicates the place for storing the difference value . Some 6-bit quantities are spread over two words, as shown by dashed lines. Blocks of color indicate parallel processing of the difference values needed to decode the value . aHorizontal layout, with values stored in index order. bVertical layout, with values striped across each set of words, as used in BP128. c Columnar layout, unidirectional scheme, with indices at increments of 4 striped across each set of words. d Columnar layout, bidirectional scheme
Mentions: When the bit width for a block is less than 32, more than one integer can be stored in each 32-bit word. The way in which integers are arranged in words can be considered a layout. Two layouts have been proposed so far for vectorized bitpacking. In a horizontal layout, adjacent integers are packed next to each other [14]. Figure 2a shows the horizontal layout for a bit width of 6, where each row represents a 128-bit vector of four 32-bit words that are accessed in parallel. Integers that are processed in parallel are shaded with the same color. In the decoding process, this layout requires a separate shift operation for each integer in the word. Although SIMD operations exist for performing distinct shifts for the words in a register, they entail a fair amount of complexity in the decoding algorithm.Fig. 2

Bottom Line: We propose to compress the offset array using vectorized bitpacking.Furthermore, among vectorized bitpacking schemes, our BP64-columnar format yields retrieval times that are faster than the fastest known bitpacking format by a factor of 3 for retrieving a single value, and a factor of 2 for retrieving two adjacent values.It also has potential applications to other domains requiring differential coding with random access.

View Article: PubMed Central - PubMed

Affiliation: Department of Bioinformatics and Computational Biology, Genentech, Inc., 1 DNA Way, 94080, South San Francisco, CA 94080 USA.

ABSTRACT

Background: Hash tables constitute a widely used data structure for indexing genomes that provides a list of genomic positions for each possible oligomer of a given size. The offset array in a hash table grows exponentially with the oligomer size and precludes the use of larger oligomers that could facilitate rapid alignment of sequences to a genome.

Results: We propose to compress the offset array using vectorized bitpacking. We introduce an algorithm and data structure called BP64-columnar that achieves fast random access in arrays of monotonically nondecreasing integers. Experimental results based on hash tables for the fly, chicken, and human genomes show that BP64-columnar is 3 to 4 times faster than publicly available implementations of universal coding schemes, such as Elias gamma, Elias delta, and Fibonacci compression. Furthermore, among vectorized bitpacking schemes, our BP64-columnar format yields retrieval times that are faster than the fastest known bitpacking format by a factor of 3 for retrieving a single value, and a factor of 2 for retrieving two adjacent values.

Conclusions: Our BP64-columnar scheme enables compression of genomic hash tables with fast retrieval. It also has potential applications to other domains requiring differential coding with random access.

No MeSH data available.


Related in: MedlinePlus