Limits...
Applying Shannon's information theory to bacterial and phage genomes and metagenomes.

Akhter S, Bailey BA, Salamon P, Aziz RK, Edwards RA - Sci Rep (2013)

Bottom Line: The information content of a genome was found to be highly dependent on the genome length, GC content, and sequence word size.In metagenomic sequences, the amount of information correlated with the number of matches found by comparison to sequence databases.A sequence with more information (higher uncertainty) has a higher probability of being significantly similar to other sequences in the database.

View Article: PubMed Central - PubMed

Affiliation: Computational Science Research Center, San Diego State University, San Diego, CA, USA. sakhter@sciences.sdsu.edu

ABSTRACT
All sequence data contain inherent information that can be measured by Shannon's uncertainty theory. Such measurement is valuable in evaluating large data sets, such as metagenomic libraries, to prioritize their analysis and annotation, thus saving computational resources. Here, Shannon's index of complete phage and bacterial genomes was examined. The information content of a genome was found to be highly dependent on the genome length, GC content, and sequence word size. In metagenomic sequences, the amount of information correlated with the number of matches found by comparison to sequence databases. A sequence with more information (higher uncertainty) has a higher probability of being significantly similar to other sequences in the database. Measuring uncertainty may be used for rapid screening for sequences with matches in available database, prioritizing computational resources, and indicating which sequences with no known similarities are likely to be important for more detailed analysis.

Show MeSH
(a) Cumulative comparison of the uncertainty (for word length 1) in DNA sequences in metagenome samples. Eight samples representative of the 24 used in this study are shown here: Soudan Mine Black Stuff (pink34), Line islands Kingman reef phage (light green35), Line islands Tabuaren phage (light blue35), Marine phages from the Gulf of Mexico (blue29), Marine samples supplemented with DMSP (magenta36), Line islands Palmyra Phage (dark green35), Line islands Christmas Reef phage (red35), Marine samples supplemented with vanillate (green36)). The uncertainty is greater than 1.7 for 85% to 90% sequences of all samples. (b) Comparison of Shannon's uncertainty and the observed similarity to known sequences. Shannon's uncertainty (H) was calculated for word length one, and is compared with similarity to the SEED no-redundant protein database. Samples are coloured the same as in Fig. 5. Word lengths up to 11 letters were also used to calculate (H) and all cases confer same results (data not shown).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3539204&req=5

f5: (a) Cumulative comparison of the uncertainty (for word length 1) in DNA sequences in metagenome samples. Eight samples representative of the 24 used in this study are shown here: Soudan Mine Black Stuff (pink34), Line islands Kingman reef phage (light green35), Line islands Tabuaren phage (light blue35), Marine phages from the Gulf of Mexico (blue29), Marine samples supplemented with DMSP (magenta36), Line islands Palmyra Phage (dark green35), Line islands Christmas Reef phage (red35), Marine samples supplemented with vanillate (green36)). The uncertainty is greater than 1.7 for 85% to 90% sequences of all samples. (b) Comparison of Shannon's uncertainty and the observed similarity to known sequences. Shannon's uncertainty (H) was calculated for word length one, and is compared with similarity to the SEED no-redundant protein database. Samples are coloured the same as in Fig. 5. Word lengths up to 11 letters were also used to calculate (H) and all cases confer same results (data not shown).

Mentions: Shannon's uncertainty was calculated for different metagenomic data sets. The maximum uncertainty equates to a sequence that has equal frequencies of each word (e.g. A, G, C, T for word length one) and the majority of reads in a metagenome have an uncertainty greater than 1.8 per nt (Figure 5a) suggesting an even distribution of bases in the reads, although the relative information content of the reads varies by sample.


Applying Shannon's information theory to bacterial and phage genomes and metagenomes.

Akhter S, Bailey BA, Salamon P, Aziz RK, Edwards RA - Sci Rep (2013)

(a) Cumulative comparison of the uncertainty (for word length 1) in DNA sequences in metagenome samples. Eight samples representative of the 24 used in this study are shown here: Soudan Mine Black Stuff (pink34), Line islands Kingman reef phage (light green35), Line islands Tabuaren phage (light blue35), Marine phages from the Gulf of Mexico (blue29), Marine samples supplemented with DMSP (magenta36), Line islands Palmyra Phage (dark green35), Line islands Christmas Reef phage (red35), Marine samples supplemented with vanillate (green36)). The uncertainty is greater than 1.7 for 85% to 90% sequences of all samples. (b) Comparison of Shannon's uncertainty and the observed similarity to known sequences. Shannon's uncertainty (H) was calculated for word length one, and is compared with similarity to the SEED no-redundant protein database. Samples are coloured the same as in Fig. 5. Word lengths up to 11 letters were also used to calculate (H) and all cases confer same results (data not shown).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3539204&req=5

f5: (a) Cumulative comparison of the uncertainty (for word length 1) in DNA sequences in metagenome samples. Eight samples representative of the 24 used in this study are shown here: Soudan Mine Black Stuff (pink34), Line islands Kingman reef phage (light green35), Line islands Tabuaren phage (light blue35), Marine phages from the Gulf of Mexico (blue29), Marine samples supplemented with DMSP (magenta36), Line islands Palmyra Phage (dark green35), Line islands Christmas Reef phage (red35), Marine samples supplemented with vanillate (green36)). The uncertainty is greater than 1.7 for 85% to 90% sequences of all samples. (b) Comparison of Shannon's uncertainty and the observed similarity to known sequences. Shannon's uncertainty (H) was calculated for word length one, and is compared with similarity to the SEED no-redundant protein database. Samples are coloured the same as in Fig. 5. Word lengths up to 11 letters were also used to calculate (H) and all cases confer same results (data not shown).
Mentions: Shannon's uncertainty was calculated for different metagenomic data sets. The maximum uncertainty equates to a sequence that has equal frequencies of each word (e.g. A, G, C, T for word length one) and the majority of reads in a metagenome have an uncertainty greater than 1.8 per nt (Figure 5a) suggesting an even distribution of bases in the reads, although the relative information content of the reads varies by sample.

Bottom Line: The information content of a genome was found to be highly dependent on the genome length, GC content, and sequence word size.In metagenomic sequences, the amount of information correlated with the number of matches found by comparison to sequence databases.A sequence with more information (higher uncertainty) has a higher probability of being significantly similar to other sequences in the database.

View Article: PubMed Central - PubMed

Affiliation: Computational Science Research Center, San Diego State University, San Diego, CA, USA. sakhter@sciences.sdsu.edu

ABSTRACT
All sequence data contain inherent information that can be measured by Shannon's uncertainty theory. Such measurement is valuable in evaluating large data sets, such as metagenomic libraries, to prioritize their analysis and annotation, thus saving computational resources. Here, Shannon's index of complete phage and bacterial genomes was examined. The information content of a genome was found to be highly dependent on the genome length, GC content, and sequence word size. In metagenomic sequences, the amount of information correlated with the number of matches found by comparison to sequence databases. A sequence with more information (higher uncertainty) has a higher probability of being significantly similar to other sequences in the database. Measuring uncertainty may be used for rapid screening for sequences with matches in available database, prioritizing computational resources, and indicating which sequences with no known similarities are likely to be important for more detailed analysis.

Show MeSH