Limits...
Artificial and natural duplicates in pyrosequencing reads of metagenomic data.

Niu B, Fu L, Sun S, Li W - BMC Bioinformatics (2010)

Bottom Line: However, since the duplicated reads observed in a pyrosequencing run also include natural (non-artificial) duplicates, simply removing all duplicates may also cause underestimation of abundance associated with natural duplicates.We compared the occurrences of the duplicates identified by our method and the natural duplicates made by independent simulations.We provide a tool to estimate the number of natural duplicates according to user-defined sample types, so users can decide whether to retain or remove duplicates in their projects.

View Article: PubMed Central - HTML - PubMed

Affiliation: California Institute for Telecommunications and Information Technology, University of California San Diego, La Jolla, California 92093, USA.

ABSTRACT

Background: Artificial duplicates from pyrosequencing reads may lead to incorrect interpretation of the abundance of species and genes in metagenomic studies. Duplicated reads were filtered out in many metagenomic projects. However, since the duplicated reads observed in a pyrosequencing run also include natural (non-artificial) duplicates, simply removing all duplicates may also cause underestimation of abundance associated with natural duplicates.

Results: We implemented a method for identification of exact and nearly identical duplicates from pyrosequencing reads. This method performs an all-against-all sequence comparison and clusters the duplicates into groups using an algorithm modified from our previous sequence clustering method cd-hit. This method can process a typical dataset in approximately 10 minutes; it also provides a consensus sequence for each group of duplicates. We applied this method to the underlying raw reads of 39 genomic projects and 10 metagenomic projects that utilized pyrosequencing technique. We compared the occurrences of the duplicates identified by our method and the natural duplicates made by independent simulations. We observed that the duplicates, including both artificial and natural duplicates, make up 4-44% of reads. The number of natural duplicates highly correlates with the samples' read density (number of reads divided by genome size). For high-complexity metagenomic samples lacking dominant species, natural duplicates only make up <1% of all duplicates. But for some other samples like transcriptomic samples, majority of the observed duplicates might be natural duplicates.

Conclusions: Our method is available from http://cd-hit.org as a downloadable program and a web server. It is important not only to identify the duplicates from metagenomic datasets but also to distinguish whether they are artificial or natural duplicates. We provide a tool to estimate the number of natural duplicates according to user-defined sample types, so users can decide whether to retain or remove duplicates in their projects.

Show MeSH
Ratio of all duplicates and average natural duplicates to all reads from genome projects. X-axis is project identifier of datasets, which are ordered by decreasing read density (number of reads divided by genome size). Y-axis is the ratio of duplicated reads to all reads.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2874554&req=5

Figure 1: Ratio of all duplicates and average natural duplicates to all reads from genome projects. X-axis is project identifier of datasets, which are ordered by decreasing read density (number of reads divided by genome size). Y-axis is the ratio of duplicated reads to all reads.

Mentions: A simulated reads set and the experimental raw reads set in each genome project have the exactly the same number of sequences of exactly the same lengths. We generated 1000 sets of simulated reads for each genome project and selected 100 sets that are most similar to its corresponding raw reads in GC content. We further introduced sequencing errors (insertions, deletions, and substitutions) to the simulated reads according to the error model derived in this study (Table 2, 3). These processes made the simulated read sets as similar as possible to the real reads set, except that the former only contains natural duplicates. Using cdhit-454, we identified the duplicates for the 100 sets of simulated reads of each project and calculated the average duplicate ratio and the standard deviations. Figure 1 shows the ratio of all duplicates and the average natural duplicates for these 39 projects. The results and the standard deviations are also available in Table 1.


Artificial and natural duplicates in pyrosequencing reads of metagenomic data.

Niu B, Fu L, Sun S, Li W - BMC Bioinformatics (2010)

Ratio of all duplicates and average natural duplicates to all reads from genome projects. X-axis is project identifier of datasets, which are ordered by decreasing read density (number of reads divided by genome size). Y-axis is the ratio of duplicated reads to all reads.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2874554&req=5

Figure 1: Ratio of all duplicates and average natural duplicates to all reads from genome projects. X-axis is project identifier of datasets, which are ordered by decreasing read density (number of reads divided by genome size). Y-axis is the ratio of duplicated reads to all reads.
Mentions: A simulated reads set and the experimental raw reads set in each genome project have the exactly the same number of sequences of exactly the same lengths. We generated 1000 sets of simulated reads for each genome project and selected 100 sets that are most similar to its corresponding raw reads in GC content. We further introduced sequencing errors (insertions, deletions, and substitutions) to the simulated reads according to the error model derived in this study (Table 2, 3). These processes made the simulated read sets as similar as possible to the real reads set, except that the former only contains natural duplicates. Using cdhit-454, we identified the duplicates for the 100 sets of simulated reads of each project and calculated the average duplicate ratio and the standard deviations. Figure 1 shows the ratio of all duplicates and the average natural duplicates for these 39 projects. The results and the standard deviations are also available in Table 1.

Bottom Line: However, since the duplicated reads observed in a pyrosequencing run also include natural (non-artificial) duplicates, simply removing all duplicates may also cause underestimation of abundance associated with natural duplicates.We compared the occurrences of the duplicates identified by our method and the natural duplicates made by independent simulations.We provide a tool to estimate the number of natural duplicates according to user-defined sample types, so users can decide whether to retain or remove duplicates in their projects.

View Article: PubMed Central - HTML - PubMed

Affiliation: California Institute for Telecommunications and Information Technology, University of California San Diego, La Jolla, California 92093, USA.

ABSTRACT

Background: Artificial duplicates from pyrosequencing reads may lead to incorrect interpretation of the abundance of species and genes in metagenomic studies. Duplicated reads were filtered out in many metagenomic projects. However, since the duplicated reads observed in a pyrosequencing run also include natural (non-artificial) duplicates, simply removing all duplicates may also cause underestimation of abundance associated with natural duplicates.

Results: We implemented a method for identification of exact and nearly identical duplicates from pyrosequencing reads. This method performs an all-against-all sequence comparison and clusters the duplicates into groups using an algorithm modified from our previous sequence clustering method cd-hit. This method can process a typical dataset in approximately 10 minutes; it also provides a consensus sequence for each group of duplicates. We applied this method to the underlying raw reads of 39 genomic projects and 10 metagenomic projects that utilized pyrosequencing technique. We compared the occurrences of the duplicates identified by our method and the natural duplicates made by independent simulations. We observed that the duplicates, including both artificial and natural duplicates, make up 4-44% of reads. The number of natural duplicates highly correlates with the samples' read density (number of reads divided by genome size). For high-complexity metagenomic samples lacking dominant species, natural duplicates only make up <1% of all duplicates. But for some other samples like transcriptomic samples, majority of the observed duplicates might be natural duplicates.

Conclusions: Our method is available from http://cd-hit.org as a downloadable program and a web server. It is important not only to identify the duplicates from metagenomic datasets but also to distinguish whether they are artificial or natural duplicates. We provide a tool to estimate the number of natural duplicates according to user-defined sample types, so users can decide whether to retain or remove duplicates in their projects.

Show MeSH