Limits...
Applying Shannon's information theory to bacterial and phage genomes and metagenomes.

Akhter S, Bailey BA, Salamon P, Aziz RK, Edwards RA - Sci Rep (2013)

Bottom Line: The information content of a genome was found to be highly dependent on the genome length, GC content, and sequence word size.In metagenomic sequences, the amount of information correlated with the number of matches found by comparison to sequence databases.A sequence with more information (higher uncertainty) has a higher probability of being significantly similar to other sequences in the database.

View Article: PubMed Central - PubMed

Affiliation: Computational Science Research Center, San Diego State University, San Diego, CA, USA. sakhter@sciences.sdsu.edu

ABSTRACT
All sequence data contain inherent information that can be measured by Shannon's uncertainty theory. Such measurement is valuable in evaluating large data sets, such as metagenomic libraries, to prioritize their analysis and annotation, thus saving computational resources. Here, Shannon's index of complete phage and bacterial genomes was examined. The information content of a genome was found to be highly dependent on the genome length, GC content, and sequence word size. In metagenomic sequences, the amount of information correlated with the number of matches found by comparison to sequence databases. A sequence with more information (higher uncertainty) has a higher probability of being significantly similar to other sequences in the database. Measuring uncertainty may be used for rapid screening for sequences with matches in available database, prioritizing computational resources, and indicating which sequences with no known similarities are likely to be important for more detailed analysis.

Show MeSH
(a) Shannon's index vs. GC% for 600 complete phage genomes using word length 1 nt to 7 nt. (b) The relationship between Shannon's index and /GC% −0.5/ for 600 complete phage genomes using word length 1 nt to 7 nt.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3539204&req=5

f4: (a) Shannon's index vs. GC% for 600 complete phage genomes using word length 1 nt to 7 nt. (b) The relationship between Shannon's index and /GC% −0.5/ for 600 complete phage genomes using word length 1 nt to 7 nt.

Mentions: For most phage genomes, the maximum word length that should be used to calculate Shannon's index (Equation 1) is 7 nt. When word lengths from 1 to 7 nt were used to calculate Shannon's index, GC-rich and GC-poor genomes were found to have lower Shannon's index since these genomes tend to have less diverse word combinations than genomes with 50% GC content (Figure 4a). The strong relationship between Shannon's index and /GC% − 0.5/ for word length 1 to 5 nt suggests that Shannon's index is strongly influenced by the GC composition of the DNA sequence (Figure 4b). For word lengths above 6 nt, the relationship is not strongly supported. Different sequences may have the same GC%, but Shannon's index depends on the distribution of the different word combinations. Therefore two different sequences having the same GC% may have different Shannon's indices, and the probability of this happening increases with the word length. Thus, as word length is increased, the correlation between Shannon's index and GC content becomes weaker (Figure 4b).


Applying Shannon's information theory to bacterial and phage genomes and metagenomes.

Akhter S, Bailey BA, Salamon P, Aziz RK, Edwards RA - Sci Rep (2013)

(a) Shannon's index vs. GC% for 600 complete phage genomes using word length 1 nt to 7 nt. (b) The relationship between Shannon's index and /GC% −0.5/ for 600 complete phage genomes using word length 1 nt to 7 nt.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3539204&req=5

f4: (a) Shannon's index vs. GC% for 600 complete phage genomes using word length 1 nt to 7 nt. (b) The relationship between Shannon's index and /GC% −0.5/ for 600 complete phage genomes using word length 1 nt to 7 nt.
Mentions: For most phage genomes, the maximum word length that should be used to calculate Shannon's index (Equation 1) is 7 nt. When word lengths from 1 to 7 nt were used to calculate Shannon's index, GC-rich and GC-poor genomes were found to have lower Shannon's index since these genomes tend to have less diverse word combinations than genomes with 50% GC content (Figure 4a). The strong relationship between Shannon's index and /GC% − 0.5/ for word length 1 to 5 nt suggests that Shannon's index is strongly influenced by the GC composition of the DNA sequence (Figure 4b). For word lengths above 6 nt, the relationship is not strongly supported. Different sequences may have the same GC%, but Shannon's index depends on the distribution of the different word combinations. Therefore two different sequences having the same GC% may have different Shannon's indices, and the probability of this happening increases with the word length. Thus, as word length is increased, the correlation between Shannon's index and GC content becomes weaker (Figure 4b).

Bottom Line: The information content of a genome was found to be highly dependent on the genome length, GC content, and sequence word size.In metagenomic sequences, the amount of information correlated with the number of matches found by comparison to sequence databases.A sequence with more information (higher uncertainty) has a higher probability of being significantly similar to other sequences in the database.

View Article: PubMed Central - PubMed

Affiliation: Computational Science Research Center, San Diego State University, San Diego, CA, USA. sakhter@sciences.sdsu.edu

ABSTRACT
All sequence data contain inherent information that can be measured by Shannon's uncertainty theory. Such measurement is valuable in evaluating large data sets, such as metagenomic libraries, to prioritize their analysis and annotation, thus saving computational resources. Here, Shannon's index of complete phage and bacterial genomes was examined. The information content of a genome was found to be highly dependent on the genome length, GC content, and sequence word size. In metagenomic sequences, the amount of information correlated with the number of matches found by comparison to sequence databases. A sequence with more information (higher uncertainty) has a higher probability of being significantly similar to other sequences in the database. Measuring uncertainty may be used for rapid screening for sequences with matches in available database, prioritizing computational resources, and indicating which sequences with no known similarities are likely to be important for more detailed analysis.

Show MeSH