Limits...
Applying Shannon's information theory to bacterial and phage genomes and metagenomes.

Akhter S, Bailey BA, Salamon P, Aziz RK, Edwards RA - Sci Rep (2013)

Bottom Line: The information content of a genome was found to be highly dependent on the genome length, GC content, and sequence word size.In metagenomic sequences, the amount of information correlated with the number of matches found by comparison to sequence databases.A sequence with more information (higher uncertainty) has a higher probability of being significantly similar to other sequences in the database.

View Article: PubMed Central - PubMed

Affiliation: Computational Science Research Center, San Diego State University, San Diego, CA, USA. sakhter@sciences.sdsu.edu

ABSTRACT
All sequence data contain inherent information that can be measured by Shannon's uncertainty theory. Such measurement is valuable in evaluating large data sets, such as metagenomic libraries, to prioritize their analysis and annotation, thus saving computational resources. Here, Shannon's index of complete phage and bacterial genomes was examined. The information content of a genome was found to be highly dependent on the genome length, GC content, and sequence word size. In metagenomic sequences, the amount of information correlated with the number of matches found by comparison to sequence databases. A sequence with more information (higher uncertainty) has a higher probability of being significantly similar to other sequences in the database. Measuring uncertainty may be used for rapid screening for sequences with matches in available database, prioritizing computational resources, and indicating which sequences with no known similarities are likely to be important for more detailed analysis.

Show MeSH
Shannon's index vs. length for 600 complete phage genomes using word length 9 and 12.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3539204&req=5

f3: Shannon's index vs. length for 600 complete phage genomes using word length 9 and 12.

Mentions: Shannon's indices for all phage genomes have been plotted against their lengths. For word length 12 nt, Shannon's index highly correlates with the logarithm of the genome length (Figure 3). For word length 9 nt, there is still a significant correlation; however, for shorter word lengths, no significant correlation was observed between genome length and Shannon's index. For shorter word lengths, most of the genomes have almost all combination of words in their genomes, so there is no strong correlation between Shannon's index and genome length. In contrast, for longer words, the bigger genomes have more combinations of words than the smaller genomes; so Shannon's index correlates with genome length.


Applying Shannon's information theory to bacterial and phage genomes and metagenomes.

Akhter S, Bailey BA, Salamon P, Aziz RK, Edwards RA - Sci Rep (2013)

Shannon's index vs. length for 600 complete phage genomes using word length 9 and 12.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3539204&req=5

f3: Shannon's index vs. length for 600 complete phage genomes using word length 9 and 12.
Mentions: Shannon's indices for all phage genomes have been plotted against their lengths. For word length 12 nt, Shannon's index highly correlates with the logarithm of the genome length (Figure 3). For word length 9 nt, there is still a significant correlation; however, for shorter word lengths, no significant correlation was observed between genome length and Shannon's index. For shorter word lengths, most of the genomes have almost all combination of words in their genomes, so there is no strong correlation between Shannon's index and genome length. In contrast, for longer words, the bigger genomes have more combinations of words than the smaller genomes; so Shannon's index correlates with genome length.

Bottom Line: The information content of a genome was found to be highly dependent on the genome length, GC content, and sequence word size.In metagenomic sequences, the amount of information correlated with the number of matches found by comparison to sequence databases.A sequence with more information (higher uncertainty) has a higher probability of being significantly similar to other sequences in the database.

View Article: PubMed Central - PubMed

Affiliation: Computational Science Research Center, San Diego State University, San Diego, CA, USA. sakhter@sciences.sdsu.edu

ABSTRACT
All sequence data contain inherent information that can be measured by Shannon's uncertainty theory. Such measurement is valuable in evaluating large data sets, such as metagenomic libraries, to prioritize their analysis and annotation, thus saving computational resources. Here, Shannon's index of complete phage and bacterial genomes was examined. The information content of a genome was found to be highly dependent on the genome length, GC content, and sequence word size. In metagenomic sequences, the amount of information correlated with the number of matches found by comparison to sequence databases. A sequence with more information (higher uncertainty) has a higher probability of being significantly similar to other sequences in the database. Measuring uncertainty may be used for rapid screening for sequences with matches in available database, prioritizing computational resources, and indicating which sequences with no known similarities are likely to be important for more detailed analysis.

Show MeSH