Limits...
Applying Shannon's information theory to bacterial and phage genomes and metagenomes.

Akhter S, Bailey BA, Salamon P, Aziz RK, Edwards RA - Sci Rep (2013)

Bottom Line: The information content of a genome was found to be highly dependent on the genome length, GC content, and sequence word size.In metagenomic sequences, the amount of information correlated with the number of matches found by comparison to sequence databases.A sequence with more information (higher uncertainty) has a higher probability of being significantly similar to other sequences in the database.

View Article: PubMed Central - PubMed

Affiliation: Computational Science Research Center, San Diego State University, San Diego, CA, USA. sakhter@sciences.sdsu.edu

ABSTRACT
All sequence data contain inherent information that can be measured by Shannon's uncertainty theory. Such measurement is valuable in evaluating large data sets, such as metagenomic libraries, to prioritize their analysis and annotation, thus saving computational resources. Here, Shannon's index of complete phage and bacterial genomes was examined. The information content of a genome was found to be highly dependent on the genome length, GC content, and sequence word size. In metagenomic sequences, the amount of information correlated with the number of matches found by comparison to sequence databases. A sequence with more information (higher uncertainty) has a higher probability of being significantly similar to other sequences in the database. Measuring uncertainty may be used for rapid screening for sequences with matches in available database, prioritizing computational resources, and indicating which sequences with no known similarities are likely to be important for more detailed analysis.

Show MeSH
Length distribution of 600 complete phages.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3539204&req=5

f2: Length distribution of 600 complete phages.

Mentions: Word length is reportedly an important factor influencing the value of Shannon's index67. A high Shannon's index (close to the maximum possible index, i.e., for word length n, the maximum index will be 2n) depends on the presence of all possible combinations of words in the genome. Consequently, the longer the genome the higher the probability of having different word variations. For a given word length of n, there are 4n possible word combinations for DNA sequences. The length of most phage genomes (585 out of 600) ranges from 47 bp to less than 49 bp (Figure 2). Therefore, for word size greater than 8 nt, many words will only be represented zero or one times, which will result in a lower Shannon's index for most of these genomes. In contrast, the average length of the 94 bacterial genomes used in this analysis is about 3 million bp (between 2×410 and 411 bp). Therefore, bacterial genomes have a higher Shannon's index than phage genomes using word lengths smaller than 12 nt. For word lengths 11 nt or 12 nt, Shannon's index can distinguish phage and bacterial genomes (Figure 1) although this is likely because phage genomes are too short to generate sufficiently high Shannon's indices for words of this size.


Applying Shannon's information theory to bacterial and phage genomes and metagenomes.

Akhter S, Bailey BA, Salamon P, Aziz RK, Edwards RA - Sci Rep (2013)

Length distribution of 600 complete phages.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3539204&req=5

f2: Length distribution of 600 complete phages.
Mentions: Word length is reportedly an important factor influencing the value of Shannon's index67. A high Shannon's index (close to the maximum possible index, i.e., for word length n, the maximum index will be 2n) depends on the presence of all possible combinations of words in the genome. Consequently, the longer the genome the higher the probability of having different word variations. For a given word length of n, there are 4n possible word combinations for DNA sequences. The length of most phage genomes (585 out of 600) ranges from 47 bp to less than 49 bp (Figure 2). Therefore, for word size greater than 8 nt, many words will only be represented zero or one times, which will result in a lower Shannon's index for most of these genomes. In contrast, the average length of the 94 bacterial genomes used in this analysis is about 3 million bp (between 2×410 and 411 bp). Therefore, bacterial genomes have a higher Shannon's index than phage genomes using word lengths smaller than 12 nt. For word lengths 11 nt or 12 nt, Shannon's index can distinguish phage and bacterial genomes (Figure 1) although this is likely because phage genomes are too short to generate sufficiently high Shannon's indices for words of this size.

Bottom Line: The information content of a genome was found to be highly dependent on the genome length, GC content, and sequence word size.In metagenomic sequences, the amount of information correlated with the number of matches found by comparison to sequence databases.A sequence with more information (higher uncertainty) has a higher probability of being significantly similar to other sequences in the database.

View Article: PubMed Central - PubMed

Affiliation: Computational Science Research Center, San Diego State University, San Diego, CA, USA. sakhter@sciences.sdsu.edu

ABSTRACT
All sequence data contain inherent information that can be measured by Shannon's uncertainty theory. Such measurement is valuable in evaluating large data sets, such as metagenomic libraries, to prioritize their analysis and annotation, thus saving computational resources. Here, Shannon's index of complete phage and bacterial genomes was examined. The information content of a genome was found to be highly dependent on the genome length, GC content, and sequence word size. In metagenomic sequences, the amount of information correlated with the number of matches found by comparison to sequence databases. A sequence with more information (higher uncertainty) has a higher probability of being significantly similar to other sequences in the database. Measuring uncertainty may be used for rapid screening for sequences with matches in available database, prioritizing computational resources, and indicating which sequences with no known similarities are likely to be important for more detailed analysis.

Show MeSH