Limits...
ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences.

Sun Y, Cai Y, Liu L, Yu F, Farrell ML, McKendree W, Farmerie W - Nucleic Acids Res. (2009)

Bottom Line: Recent metagenomics studies of environmental samples suggested that microbial communities are much more diverse than previously reported, and deep sequencing will significantly increase the estimate of total species diversity.Massively parallel pyrosequencing technology enables ultra-deep sequencing of complex microbial populations rapidly and inexpensively.We developed two versions of ESPRIT, one for personal computers (PCs) and one for computer clusters (CCs).

View Article: PubMed Central - PubMed

Affiliation: Interdisciplinary Center for Biotechnology Research, University of Florida, Gainesville, FL 32610-3622, USA. sunyijun@biotech.ufl.edu

ABSTRACT
Recent metagenomics studies of environmental samples suggested that microbial communities are much more diverse than previously reported, and deep sequencing will significantly increase the estimate of total species diversity. Massively parallel pyrosequencing technology enables ultra-deep sequencing of complex microbial populations rapidly and inexpensively. However, computational methods for analyzing large collections of 16S ribosomal sequences are limited. We proposed a new algorithm, referred to as ESPRIT, which addresses several computational issues with prior methods. We developed two versions of ESPRIT, one for personal computers (PCs) and one for computer clusters (CCs). The PC version is used for small- and medium-scale data sets and can process several tens of thousands of sequences within a few minutes, while the CC version is for large-scale problems and is able to analyze several hundreds of thousands of reads within one day. Large-scale experiments are presented that clearly demonstrate the effectiveness of the newly proposed algorithm. The source code and user guide are freely available at http://www.biotech.ufl.edu/people/sun/esprit.html.

Show MeSH

Related in: MedlinePlus

(a) k-mer distances are highly correlated with genetic distances. (b) Removing sequence pairs with k-mer distances larger than the default threshold has a negligible impact on the estimation accuracy for the distance levels of interest. The experiment was performed on the 53R seawater sample (see Results Section).
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2691849&req=5

Figure 1: (a) k-mer distances are highly correlated with genetic distances. (b) Removing sequence pairs with k-mer distances larger than the default threshold has a negligible impact on the estimation accuracy for the distance levels of interest. The experiment was performed on the 53R seawater sample (see Results Section).

Mentions: The naive application of the Needleman–Wunsch algorithm to an environmental sample, however, is computationally too expensive. For example, given one full run of 454 data with >400K reads, it is estimated that it would take about 3 years to align all 80 billion pairs of sequences, and need about 1000 GB to store the resulting distance matrix in the PHYLIP format. However, we notice that for the purpose of estimating biodiversity, only the sequence pairs with distances <0.10 are of most interest, which only account for a small fraction of all possible pairs (about 1–5%, Figure 1S). This means possibly 20–100 folds of speedup. Moreover, by removing unwanted pairs, distance information can be stored in a sparse matrix that requires much less memory. In order to rapidly identify the sequence pairs of interest, we compute the k-mer distance of each pair of sequences. We give below a brief description of how it works. A k-mer, also known as k-tuple, is a sub-sequence consisting of k possible nucleotides. By specifying the value of k, a complete alphabet Ω of k-mer is constructed. The number of occurrence of each k-mer in alphabet Ω, also called genomics profile (18,19), is computed for each sequence. Then, the k-mer distance of a pair of sequences is derived as 1where f1(i) and f2(i) are the numbers of the occurrence of the i-th k-mer of two sequences, respectively, and L1 and L2 are the lengths of two sequences, respectively. It has been shown that k-mer distances are highly correlated with genetic distances (20). The concept of k-mer counting has been used for various applications, including sequence alignment (10), phylogenetic analysis (21,22) and detection of horizontal gene transfer (23). We used this technique to remove unwanted sequence pairs, and found that by using a k-mer distance of 0.4, most of the sequence pairs of interest can be rapidly identified (Figure 1a). The default setting in ESPRIT is 0.5. Figure 1b shows that removing sequence pairs with k-mer distances larger than the default parameter has a negligible impact on the estimation accuracy for the distance levels <0.1. It should be noted that though the default k-mer threshold works well for all of the 10 data sets tested in this article, it is possible that the parameter may vary for different applications. Hence, we provide an auxiliary program in the software package that allows users to determine the parameter by comparing k-mer distances against genetic distances using a small subset of their samples.Figure 1.


ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences.

Sun Y, Cai Y, Liu L, Yu F, Farrell ML, McKendree W, Farmerie W - Nucleic Acids Res. (2009)

(a) k-mer distances are highly correlated with genetic distances. (b) Removing sequence pairs with k-mer distances larger than the default threshold has a negligible impact on the estimation accuracy for the distance levels of interest. The experiment was performed on the 53R seawater sample (see Results Section).
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2691849&req=5

Figure 1: (a) k-mer distances are highly correlated with genetic distances. (b) Removing sequence pairs with k-mer distances larger than the default threshold has a negligible impact on the estimation accuracy for the distance levels of interest. The experiment was performed on the 53R seawater sample (see Results Section).
Mentions: The naive application of the Needleman–Wunsch algorithm to an environmental sample, however, is computationally too expensive. For example, given one full run of 454 data with >400K reads, it is estimated that it would take about 3 years to align all 80 billion pairs of sequences, and need about 1000 GB to store the resulting distance matrix in the PHYLIP format. However, we notice that for the purpose of estimating biodiversity, only the sequence pairs with distances <0.10 are of most interest, which only account for a small fraction of all possible pairs (about 1–5%, Figure 1S). This means possibly 20–100 folds of speedup. Moreover, by removing unwanted pairs, distance information can be stored in a sparse matrix that requires much less memory. In order to rapidly identify the sequence pairs of interest, we compute the k-mer distance of each pair of sequences. We give below a brief description of how it works. A k-mer, also known as k-tuple, is a sub-sequence consisting of k possible nucleotides. By specifying the value of k, a complete alphabet Ω of k-mer is constructed. The number of occurrence of each k-mer in alphabet Ω, also called genomics profile (18,19), is computed for each sequence. Then, the k-mer distance of a pair of sequences is derived as 1where f1(i) and f2(i) are the numbers of the occurrence of the i-th k-mer of two sequences, respectively, and L1 and L2 are the lengths of two sequences, respectively. It has been shown that k-mer distances are highly correlated with genetic distances (20). The concept of k-mer counting has been used for various applications, including sequence alignment (10), phylogenetic analysis (21,22) and detection of horizontal gene transfer (23). We used this technique to remove unwanted sequence pairs, and found that by using a k-mer distance of 0.4, most of the sequence pairs of interest can be rapidly identified (Figure 1a). The default setting in ESPRIT is 0.5. Figure 1b shows that removing sequence pairs with k-mer distances larger than the default parameter has a negligible impact on the estimation accuracy for the distance levels <0.1. It should be noted that though the default k-mer threshold works well for all of the 10 data sets tested in this article, it is possible that the parameter may vary for different applications. Hence, we provide an auxiliary program in the software package that allows users to determine the parameter by comparing k-mer distances against genetic distances using a small subset of their samples.Figure 1.

Bottom Line: Recent metagenomics studies of environmental samples suggested that microbial communities are much more diverse than previously reported, and deep sequencing will significantly increase the estimate of total species diversity.Massively parallel pyrosequencing technology enables ultra-deep sequencing of complex microbial populations rapidly and inexpensively.We developed two versions of ESPRIT, one for personal computers (PCs) and one for computer clusters (CCs).

View Article: PubMed Central - PubMed

Affiliation: Interdisciplinary Center for Biotechnology Research, University of Florida, Gainesville, FL 32610-3622, USA. sunyijun@biotech.ufl.edu

ABSTRACT
Recent metagenomics studies of environmental samples suggested that microbial communities are much more diverse than previously reported, and deep sequencing will significantly increase the estimate of total species diversity. Massively parallel pyrosequencing technology enables ultra-deep sequencing of complex microbial populations rapidly and inexpensively. However, computational methods for analyzing large collections of 16S ribosomal sequences are limited. We proposed a new algorithm, referred to as ESPRIT, which addresses several computational issues with prior methods. We developed two versions of ESPRIT, one for personal computers (PCs) and one for computer clusters (CCs). The PC version is used for small- and medium-scale data sets and can process several tens of thousands of sequences within a few minutes, while the CC version is for large-scale problems and is able to analyze several hundreds of thousands of reads within one day. Large-scale experiments are presented that clearly demonstrate the effectiveness of the newly proposed algorithm. The source code and user guide are freely available at http://www.biotech.ufl.edu/people/sun/esprit.html.

Show MeSH
Related in: MedlinePlus