Limits...
Exploration and retrieval of whole-metagenome sequencing samples.

Seth S, Välimäki N, Kaski S, Honkela A - Bioinformatics (2014)

Bottom Line: Over the recent years, the field of whole-metagenome shotgun sequencing has witnessed significant growth owing to the high-throughput sequencing technologies that allow sequencing genomic samples cheaper, faster and with better coverage than before.We apply a distributed string mining framework to efficiently extract all informative sequence k-mers from a pool of metagenomic samples and use them to measure the dissimilarity between two samples.We evaluate the performance of the proposed approach on two human gut metagenome datasets as well as human microbiome project metagenomic samples.

View Article: PubMed Central - PubMed

Affiliation: Helsinki Institute for Information Technology HIIT, Department of Information and Computer Science, Aalto University, Espoo, Finland, Genome-Scale Biology Program and Department of Medical Genetics, University of Helsinki, Helsinki, Finland, and Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland.

Show MeSH

Related in: MedlinePlus

Technical overview of our DSM framework consisting of client (left) and server (right) processes. The client-side processes are responsible for computing the substring frequencies within each sample  separately. Substrings and their frequencies are found using a depth-first-traversal over a (compressed) suffix tree. Frequency information is transmitted over to the server-side by streaming it as a balanced-parenthesis representation of a sorted trie. For example, the trie on the left results as the parenthesis representation given in the middle. The server reads the client-streams and merges the (already sorted) tries in recursive manner: at each node, the server computes the entropy based on the received values and updates the affected pairwise distances. Load balancing on the server-side is achieved by hashing the prefix of the substring so that each server corresponds to a certain range of hash values
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4230234&req=5

btu340-F3: Technical overview of our DSM framework consisting of client (left) and server (right) processes. The client-side processes are responsible for computing the substring frequencies within each sample separately. Substrings and their frequencies are found using a depth-first-traversal over a (compressed) suffix tree. Frequency information is transmitted over to the server-side by streaming it as a balanced-parenthesis representation of a sorted trie. For example, the trie on the left results as the parenthesis representation given in the middle. The server reads the client-streams and merges the (already sorted) tries in recursive manner: at each node, the server computes the entropy based on the received values and updates the affected pairwise distances. Load balancing on the server-side is achieved by hashing the prefix of the substring so that each server corresponds to a certain range of hash values

Mentions: The DSM framework is based on a client-server model. The clients have one-to-one correspondence to the given samples, each client being responsible for computing the frequencies within the designated sample. The client-side computation relies heavily on suffix sorting techniques and on space-efficient data structures for strings (Välimäki and Puglisi, 2012): the input data are first preprocessed into a compressed representation, which replaces the input data and acts as an efficient search structure. The server-side computation is more straightforward: the server simply merges the (sorted) input from the clients, computes the entropies and updates the distance matrices. Figure 3 gives a toy example of the client–server interaction. Two crucial observations are needed to keep the whole computation and transmission costs feasible. First, the informative k-mers can be seen as a subset of left–right-branching substrings, i.e. substrings whose instances have differentiating continuation on both left and right. More formally: substring w of string T[1,n] is called right-branching if there exists two symbols a and b such that and both wa and wb are substrings of T. Similarly, a substring w is left-branching if aw and , are substrings of T. If a substring is both left-branching and right-branching, we say it is left–right-branching. Second, for any string of length n, there are at most O(n) left–right-branching substrings, and the total length of all such substrings is bounded by (Kärkkäinen et al., 2009, Theorem 1).Fig. 3.


Exploration and retrieval of whole-metagenome sequencing samples.

Seth S, Välimäki N, Kaski S, Honkela A - Bioinformatics (2014)

Technical overview of our DSM framework consisting of client (left) and server (right) processes. The client-side processes are responsible for computing the substring frequencies within each sample  separately. Substrings and their frequencies are found using a depth-first-traversal over a (compressed) suffix tree. Frequency information is transmitted over to the server-side by streaming it as a balanced-parenthesis representation of a sorted trie. For example, the trie on the left results as the parenthesis representation given in the middle. The server reads the client-streams and merges the (already sorted) tries in recursive manner: at each node, the server computes the entropy based on the received values and updates the affected pairwise distances. Load balancing on the server-side is achieved by hashing the prefix of the substring so that each server corresponds to a certain range of hash values
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4230234&req=5

btu340-F3: Technical overview of our DSM framework consisting of client (left) and server (right) processes. The client-side processes are responsible for computing the substring frequencies within each sample separately. Substrings and their frequencies are found using a depth-first-traversal over a (compressed) suffix tree. Frequency information is transmitted over to the server-side by streaming it as a balanced-parenthesis representation of a sorted trie. For example, the trie on the left results as the parenthesis representation given in the middle. The server reads the client-streams and merges the (already sorted) tries in recursive manner: at each node, the server computes the entropy based on the received values and updates the affected pairwise distances. Load balancing on the server-side is achieved by hashing the prefix of the substring so that each server corresponds to a certain range of hash values
Mentions: The DSM framework is based on a client-server model. The clients have one-to-one correspondence to the given samples, each client being responsible for computing the frequencies within the designated sample. The client-side computation relies heavily on suffix sorting techniques and on space-efficient data structures for strings (Välimäki and Puglisi, 2012): the input data are first preprocessed into a compressed representation, which replaces the input data and acts as an efficient search structure. The server-side computation is more straightforward: the server simply merges the (sorted) input from the clients, computes the entropies and updates the distance matrices. Figure 3 gives a toy example of the client–server interaction. Two crucial observations are needed to keep the whole computation and transmission costs feasible. First, the informative k-mers can be seen as a subset of left–right-branching substrings, i.e. substrings whose instances have differentiating continuation on both left and right. More formally: substring w of string T[1,n] is called right-branching if there exists two symbols a and b such that and both wa and wb are substrings of T. Similarly, a substring w is left-branching if aw and , are substrings of T. If a substring is both left-branching and right-branching, we say it is left–right-branching. Second, for any string of length n, there are at most O(n) left–right-branching substrings, and the total length of all such substrings is bounded by (Kärkkäinen et al., 2009, Theorem 1).Fig. 3.

Bottom Line: Over the recent years, the field of whole-metagenome shotgun sequencing has witnessed significant growth owing to the high-throughput sequencing technologies that allow sequencing genomic samples cheaper, faster and with better coverage than before.We apply a distributed string mining framework to efficiently extract all informative sequence k-mers from a pool of metagenomic samples and use them to measure the dissimilarity between two samples.We evaluate the performance of the proposed approach on two human gut metagenome datasets as well as human microbiome project metagenomic samples.

View Article: PubMed Central - PubMed

Affiliation: Helsinki Institute for Information Technology HIIT, Department of Information and Computer Science, Aalto University, Espoo, Finland, Genome-Scale Biology Program and Department of Medical Genetics, University of Helsinki, Helsinki, Finland, and Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland.

Show MeSH
Related in: MedlinePlus