Limits...
CoMeta: classification of metagenomes using k-mers.

Kawulok J, Deorowicz S - PLoS ONE (2015)

Bottom Line: In CoMeta, we used the exact method for read classification using short subsequences (k-mers) and fast program for indexing large set of k-mers.In contrast to the most popular methods based on BLAST, where the query is compared with each reference sequence, we begin the classification from the top of the taxonomy tree to reduce the number of comparisons.The presented experimental study confirms that CoMeta outperforms other programs used in this context.

View Article: PubMed Central - PubMed

Affiliation: Institute of Informatics, Silesian University of Technology, Gliwice, Poland.

ABSTRACT
Nowadays, the study of environmental samples has been developing rapidly. Characterization of the environment composition broadens the knowledge about the relationship between species composition and environmental conditions. An important element of extracting the knowledge of the sample composition is to compare the extracted fragments of DNA with sequences derived from known organisms. In the presented paper, we introduce an algorithm called CoMeta (Classification of metagenomes), which assigns a query read (a DNA fragment) into one of the groups previously prepared by the user. Typically, this is one of the taxonomic rank (e.g., phylum, genus), however prepared groups may contain sequences having various functions. In CoMeta, we used the exact method for read classification using short subsequences (k-mers) and fast program for indexing large set of k-mers. In contrast to the most popular methods based on BLAST, where the query is compared with each reference sequence, we begin the classification from the top of the taxonomy tree to reduce the number of comparisons. The presented experimental study confirms that CoMeta outperforms other programs used in this context. CoMeta is available at https://github.com/jkawulok/cometa under a free GNU GPL 2 license.

Show MeSH

Related in: MedlinePlus

An example of comparing the query read with the reference sequence.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4401624&req=5

pone.0121453.g003: An example of comparing the query read with the reference sequence.

Mentions: In the comparison step, the set of reads M is compared with all nk-mer databases that have been created beforehand. Each database Di is loaded into RAM. When comparing each read from M against Gi group (1 ≤ i ≤ n), the match score (ξ) is obtained by cumulating the similarity between the k-mers extracted from the read and from the reference sequences in Gi. For a given read R, the successive k-mers are obtained using the 1-base sliding window. All possible subsequent k-mers from R are checked for occurrence in Di. For each j-th k-mer of R found in Di, the match score ξ is increased by ξj, which is the number of bases in the k-mer that have not yet contributed to the match score (i.e., , where is the number of overlapping bases between the j-th k-mer and the previous k-mer found in Di). Due to the 1-base sliding window, two subsequent read k-mers have k−1 overlapping bases, and our intention is to prevent from increasing the match score too much, if both exist in Di. The number of the overlapping bases between the p-th and q-th k-mers (p < q) is . An example on how the match score is calculated is presented in Fig 3 for k = 5. For simplicity, we assume that Gi group contains only one reference sequence, S. In the “k-mers” column, the k-mers that occur in the query read are sequentially listed. Those k-mers, which are found in Di database, are marked in bold (a sorted list of the k-mers from Gi group is shown in the left part of the figure). The final match score for the sample read is 12, and the match rate score (Ξ), which is percentage ratio of the match score to the read length, is 85.7% (Ξ = 12/14⋅100%). For better illustration, the sequence matching is also shown at the top of the figure.


CoMeta: classification of metagenomes using k-mers.

Kawulok J, Deorowicz S - PLoS ONE (2015)

An example of comparing the query read with the reference sequence.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4401624&req=5

pone.0121453.g003: An example of comparing the query read with the reference sequence.
Mentions: In the comparison step, the set of reads M is compared with all nk-mer databases that have been created beforehand. Each database Di is loaded into RAM. When comparing each read from M against Gi group (1 ≤ i ≤ n), the match score (ξ) is obtained by cumulating the similarity between the k-mers extracted from the read and from the reference sequences in Gi. For a given read R, the successive k-mers are obtained using the 1-base sliding window. All possible subsequent k-mers from R are checked for occurrence in Di. For each j-th k-mer of R found in Di, the match score ξ is increased by ξj, which is the number of bases in the k-mer that have not yet contributed to the match score (i.e., , where is the number of overlapping bases between the j-th k-mer and the previous k-mer found in Di). Due to the 1-base sliding window, two subsequent read k-mers have k−1 overlapping bases, and our intention is to prevent from increasing the match score too much, if both exist in Di. The number of the overlapping bases between the p-th and q-th k-mers (p < q) is . An example on how the match score is calculated is presented in Fig 3 for k = 5. For simplicity, we assume that Gi group contains only one reference sequence, S. In the “k-mers” column, the k-mers that occur in the query read are sequentially listed. Those k-mers, which are found in Di database, are marked in bold (a sorted list of the k-mers from Gi group is shown in the left part of the figure). The final match score for the sample read is 12, and the match rate score (Ξ), which is percentage ratio of the match score to the read length, is 85.7% (Ξ = 12/14⋅100%). For better illustration, the sequence matching is also shown at the top of the figure.

Bottom Line: In CoMeta, we used the exact method for read classification using short subsequences (k-mers) and fast program for indexing large set of k-mers.In contrast to the most popular methods based on BLAST, where the query is compared with each reference sequence, we begin the classification from the top of the taxonomy tree to reduce the number of comparisons.The presented experimental study confirms that CoMeta outperforms other programs used in this context.

View Article: PubMed Central - PubMed

Affiliation: Institute of Informatics, Silesian University of Technology, Gliwice, Poland.

ABSTRACT
Nowadays, the study of environmental samples has been developing rapidly. Characterization of the environment composition broadens the knowledge about the relationship between species composition and environmental conditions. An important element of extracting the knowledge of the sample composition is to compare the extracted fragments of DNA with sequences derived from known organisms. In the presented paper, we introduce an algorithm called CoMeta (Classification of metagenomes), which assigns a query read (a DNA fragment) into one of the groups previously prepared by the user. Typically, this is one of the taxonomic rank (e.g., phylum, genus), however prepared groups may contain sequences having various functions. In CoMeta, we used the exact method for read classification using short subsequences (k-mers) and fast program for indexing large set of k-mers. In contrast to the most popular methods based on BLAST, where the query is compared with each reference sequence, we begin the classification from the top of the taxonomy tree to reduce the number of comparisons. The presented experimental study confirms that CoMeta outperforms other programs used in this context. CoMeta is available at https://github.com/jkawulok/cometa under a free GNU GPL 2 license.

Show MeSH
Related in: MedlinePlus