Limits...
Streamlined Genome Sequence Compression using Distributed Source Coding.

Wang S, Jiang X, Chen F, Cui L, Cheng S - Cancer Inform (2014)

Bottom Line: Existing techniques that require heavy client (encoder side) cannot be applied.To tackle this challenge, we carefully examined distributed source coding theory and developed a customized reference-based genome compression protocol to meet the low-complexity need at the client side.Our experimental results showed promising performance of the proposed method when compared with the state-of-the-art algorithm (GRS).

View Article: PubMed Central - PubMed

Affiliation: Division of Biomedical Informatics, University of California, San Diego, La Jolla, CA, USA.

ABSTRACT
We aim at developing a streamlined genome sequence compression algorithm to support alternative miniaturized sequencing devices, which have limited communication, storage, and computation power. Existing techniques that require heavy client (encoder side) cannot be applied. To tackle this challenge, we carefully examined distributed source coding theory and developed a customized reference-based genome compression protocol to meet the low-complexity need at the client side. Based on the variation between source and reference, our protocol will pick adaptively either syndrome coding or hash coding to compress subsequences of changing code length. Our experimental results showed promising performance of the proposed method when compared with the state-of-the-art algorithm (GRS).

No MeSH data available.


Related in: MedlinePlus

The diagram of hash-based coding.
© Copyright Policy - open-access
Related In: Results  -  Collection


getmorefigures.php?uid=PMC4256044&req=5

f3-cin-suppl.1-2014-123: The diagram of hash-based coding.

Mentions: If a matched hash for k = j + U,⋯j + V is detected (ie, ), the next offset compensated start location of the sliding window can be updated as j = k + L (see Fig. 3). Moreover, we claim that will be identical to , if and are matched with each other, which is the fundamental assumption of our proposed system. Intuitively, the aforementioned assumption can be enforced by choosing a strong hash code with a small search region. The experimental results based on sequences22,23 with total more than 238 million bases demonstrate that a 16-bit cyclic redundancy check hash code with a search region U = −2 and V = 10 provides a strong assertion of such assumption. In addition, the decoder will inform the success to the encoder and request a longer code length based on a predefined protocol as updating Lcurrent = bL0, where L0 is a predefined initial length and the scaling factor b is updated as b = b + db, db is an incremental constant, and b is initialized as 0. For example, at the beginning, Lcurrent = L0, if a matched hash is detected, the adaptive length Lcurrent will be updated as Lcurrent = dbL0, as the scaling factor b = 0 + db. Similarly, if nh number of successively matched hashes are detected, the adaptive length and its corresponding scale factor will be Lcurrent = nhdbL0 and b = nhdb, respectively.


Streamlined Genome Sequence Compression using Distributed Source Coding.

Wang S, Jiang X, Chen F, Cui L, Cheng S - Cancer Inform (2014)

The diagram of hash-based coding.
© Copyright Policy - open-access
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC4256044&req=5

f3-cin-suppl.1-2014-123: The diagram of hash-based coding.
Mentions: If a matched hash for k = j + U,⋯j + V is detected (ie, ), the next offset compensated start location of the sliding window can be updated as j = k + L (see Fig. 3). Moreover, we claim that will be identical to , if and are matched with each other, which is the fundamental assumption of our proposed system. Intuitively, the aforementioned assumption can be enforced by choosing a strong hash code with a small search region. The experimental results based on sequences22,23 with total more than 238 million bases demonstrate that a 16-bit cyclic redundancy check hash code with a search region U = −2 and V = 10 provides a strong assertion of such assumption. In addition, the decoder will inform the success to the encoder and request a longer code length based on a predefined protocol as updating Lcurrent = bL0, where L0 is a predefined initial length and the scaling factor b is updated as b = b + db, db is an incremental constant, and b is initialized as 0. For example, at the beginning, Lcurrent = L0, if a matched hash is detected, the adaptive length Lcurrent will be updated as Lcurrent = dbL0, as the scaling factor b = 0 + db. Similarly, if nh number of successively matched hashes are detected, the adaptive length and its corresponding scale factor will be Lcurrent = nhdbL0 and b = nhdb, respectively.

Bottom Line: Existing techniques that require heavy client (encoder side) cannot be applied.To tackle this challenge, we carefully examined distributed source coding theory and developed a customized reference-based genome compression protocol to meet the low-complexity need at the client side.Our experimental results showed promising performance of the proposed method when compared with the state-of-the-art algorithm (GRS).

View Article: PubMed Central - PubMed

Affiliation: Division of Biomedical Informatics, University of California, San Diego, La Jolla, CA, USA.

ABSTRACT
We aim at developing a streamlined genome sequence compression algorithm to support alternative miniaturized sequencing devices, which have limited communication, storage, and computation power. Existing techniques that require heavy client (encoder side) cannot be applied. To tackle this challenge, we carefully examined distributed source coding theory and developed a customized reference-based genome compression protocol to meet the low-complexity need at the client side. Based on the variation between source and reference, our protocol will pick adaptively either syndrome coding or hash coding to compress subsequences of changing code length. Our experimental results showed promising performance of the proposed method when compared with the state-of-the-art algorithm (GRS).

No MeSH data available.


Related in: MedlinePlus