Limits...
Exploration and retrieval of whole-metagenome sequencing samples.

Seth S, Välimäki N, Kaski S, Honkela A - Bioinformatics (2014)

Bottom Line: Over the recent years, the field of whole-metagenome shotgun sequencing has witnessed significant growth owing to the high-throughput sequencing technologies that allow sequencing genomic samples cheaper, faster and with better coverage than before.We apply a distributed string mining framework to efficiently extract all informative sequence k-mers from a pool of metagenomic samples and use them to measure the dissimilarity between two samples.We evaluate the performance of the proposed approach on two human gut metagenome datasets as well as human microbiome project metagenomic samples.

View Article: PubMed Central - PubMed

Affiliation: Helsinki Institute for Information Technology HIIT, Department of Information and Computer Science, Aalto University, Espoo, Finland, Genome-Scale Biology Program and Department of Medical Genetics, University of Helsinki, Helsinki, Finland, and Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland.

Show MeSH

Related in: MedlinePlus

Retrieval performance comparison of the proposed approach using all k-mers (‘Ak’) against the following base measures: (1) ‘FIG’: retrieval performance using known protein family, (2) ‘Abd’: Bray–Curtis dissimilarity between relative estimated abundance, (3) ‘3’:  distance between relative abundance of 3-mers. ‘Ak’ uses the ‘optimized metric’ over 101 equally spaced threshold values between 0 and 1. Each error bar shows the MAP value along with the standard error. The grey horizontal line shows retrieval by chance: MAP computed over zero similarity metric. An arrow (if present) over a method indicates whether the performance of the corresponding method is significantly better (↑) or worse (↓) than ‘Ak’: The stars denote significance level: 0 <*** < 0.001 <** < 0.01 <* < 0.05. For the synthetic datasets (in the bottom row), the relative abundance is known from experimental design. We present this result as ‘T’. For MetaHIT, we present the performance for both Jaccard and log metric, as the latter performs much better compared with the former
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4230234&req=5

btu340-F4: Retrieval performance comparison of the proposed approach using all k-mers (‘Ak’) against the following base measures: (1) ‘FIG’: retrieval performance using known protein family, (2) ‘Abd’: Bray–Curtis dissimilarity between relative estimated abundance, (3) ‘3’: distance between relative abundance of 3-mers. ‘Ak’ uses the ‘optimized metric’ over 101 equally spaced threshold values between 0 and 1. Each error bar shows the MAP value along with the standard error. The grey horizontal line shows retrieval by chance: MAP computed over zero similarity metric. An arrow (if present) over a method indicates whether the performance of the corresponding method is significantly better (↑) or worse (↓) than ‘Ak’: The stars denote significance level: 0 <*** < 0.001 <** < 0.01 <* < 0.05. For the synthetic datasets (in the bottom row), the relative abundance is known from experimental design. We present this result as ‘T’. For MetaHIT, we present the performance for both Jaccard and log metric, as the latter performs much better compared with the former

Mentions: Retrieval of samples with similar annotation: We applied the proposed approach and a number of alternatives for retrieval of similar samples from the same dataset and evaluated by how many of the retrieved samples had the same annotation: class label, disease state or body site. A comparison of the obtained MAP values averaged over queries by all positive samples is shown in Figure 4. The results show the performance achieved by the ‘optimized metric’. The alternatives we considered were—(i) retrieval performance based on the proposed distances between frequencies of 21-mers appearing in known protein families (FIGfams) with added pseudocounts but without entropy filtering (Meyer et al., 2009); (ii) retrieval based on Bray–Curtis dissimilarity between relative species abundances estimated using MetaPhlAn (Segata et al., 2012) (each feature is a species found in at least one samples; no regularization added); and (iii) retrieval based on distances between relative frequencies of 3-mers (Jiang et al., 2012).Fig. 4.


Exploration and retrieval of whole-metagenome sequencing samples.

Seth S, Välimäki N, Kaski S, Honkela A - Bioinformatics (2014)

Retrieval performance comparison of the proposed approach using all k-mers (‘Ak’) against the following base measures: (1) ‘FIG’: retrieval performance using known protein family, (2) ‘Abd’: Bray–Curtis dissimilarity between relative estimated abundance, (3) ‘3’:  distance between relative abundance of 3-mers. ‘Ak’ uses the ‘optimized metric’ over 101 equally spaced threshold values between 0 and 1. Each error bar shows the MAP value along with the standard error. The grey horizontal line shows retrieval by chance: MAP computed over zero similarity metric. An arrow (if present) over a method indicates whether the performance of the corresponding method is significantly better (↑) or worse (↓) than ‘Ak’: The stars denote significance level: 0 <*** < 0.001 <** < 0.01 <* < 0.05. For the synthetic datasets (in the bottom row), the relative abundance is known from experimental design. We present this result as ‘T’. For MetaHIT, we present the performance for both Jaccard and log metric, as the latter performs much better compared with the former
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4230234&req=5

btu340-F4: Retrieval performance comparison of the proposed approach using all k-mers (‘Ak’) against the following base measures: (1) ‘FIG’: retrieval performance using known protein family, (2) ‘Abd’: Bray–Curtis dissimilarity between relative estimated abundance, (3) ‘3’: distance between relative abundance of 3-mers. ‘Ak’ uses the ‘optimized metric’ over 101 equally spaced threshold values between 0 and 1. Each error bar shows the MAP value along with the standard error. The grey horizontal line shows retrieval by chance: MAP computed over zero similarity metric. An arrow (if present) over a method indicates whether the performance of the corresponding method is significantly better (↑) or worse (↓) than ‘Ak’: The stars denote significance level: 0 <*** < 0.001 <** < 0.01 <* < 0.05. For the synthetic datasets (in the bottom row), the relative abundance is known from experimental design. We present this result as ‘T’. For MetaHIT, we present the performance for both Jaccard and log metric, as the latter performs much better compared with the former
Mentions: Retrieval of samples with similar annotation: We applied the proposed approach and a number of alternatives for retrieval of similar samples from the same dataset and evaluated by how many of the retrieved samples had the same annotation: class label, disease state or body site. A comparison of the obtained MAP values averaged over queries by all positive samples is shown in Figure 4. The results show the performance achieved by the ‘optimized metric’. The alternatives we considered were—(i) retrieval performance based on the proposed distances between frequencies of 21-mers appearing in known protein families (FIGfams) with added pseudocounts but without entropy filtering (Meyer et al., 2009); (ii) retrieval based on Bray–Curtis dissimilarity between relative species abundances estimated using MetaPhlAn (Segata et al., 2012) (each feature is a species found in at least one samples; no regularization added); and (iii) retrieval based on distances between relative frequencies of 3-mers (Jiang et al., 2012).Fig. 4.

Bottom Line: Over the recent years, the field of whole-metagenome shotgun sequencing has witnessed significant growth owing to the high-throughput sequencing technologies that allow sequencing genomic samples cheaper, faster and with better coverage than before.We apply a distributed string mining framework to efficiently extract all informative sequence k-mers from a pool of metagenomic samples and use them to measure the dissimilarity between two samples.We evaluate the performance of the proposed approach on two human gut metagenome datasets as well as human microbiome project metagenomic samples.

View Article: PubMed Central - PubMed

Affiliation: Helsinki Institute for Information Technology HIIT, Department of Information and Computer Science, Aalto University, Espoo, Finland, Genome-Scale Biology Program and Department of Medical Genetics, University of Helsinki, Helsinki, Finland, and Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland.

Show MeSH
Related in: MedlinePlus