Limits...
The language of gene ontology: a Zipf's law analysis.

Kalankesh LR, Stevens R, Brass A - BMC Bioinformatics (2012)

Bottom Line: Annotations from the Gene Ontology Annotation project were found to follow Zipf's law.On filtering the corpora using GO evidence codes we found that the value of the measured power law exponent responded in a predictable way as a function of the evidence codes used to support the annotation.GO annotations show similar statistical behaviours to those seen in natural language with measured exponents that provide a signal which correlates with the nature of the evidence codes used to support the annotations, suggesting that the measured exponent might provide a signal regarding the information content of the annotation.

View Article: PubMed Central - HTML - PubMed

Affiliation: School of Computer Science, University of Manchester, Oxford Road, Manchester M13 9PL, UK.

ABSTRACT

Background: Most major genome projects and sequence databases provide a GO annotation of their data, either automatically or through human annotators, creating a large corpus of data written in the language of GO. Texts written in natural language show a statistical power law behaviour, Zipf's law, the exponent of which can provide useful information on the nature of the language being used. We have therefore explored the hypothesis that collections of GO annotations will show similar statistical behaviours to natural language.

Results: Annotations from the Gene Ontology Annotation project were found to follow Zipf's law. Surprisingly, the measured power law exponents were consistently different between annotation captured using the three GO sub-ontologies in the corpora (function, process and component). On filtering the corpora using GO evidence codes we found that the value of the measured power law exponent responded in a predictable way as a function of the evidence codes used to support the annotation.

Conclusions: Techniques from computational linguistics can provide new insights into the annotation process. GO annotations show similar statistical behaviours to those seen in natural language with measured exponents that provide a signal which correlates with the nature of the evidence codes used to support the annotations, suggesting that the measured exponent might provide a signal regarding the information content of the annotation.

Show MeSH
The cumulative distribution function Pr(x) plotted as a function of frequency (x) for GO gene annotations contained within Human GOA. The straight line shows the region of the plots for which a power law was found to provide a good model of the data [25]. 1(a) Annotation from the biological process sub-ontology, 1(b) annotation from the molecular function sub-ontology, and 1 (c) annotation from the cellular component sub-ontology The measured power law exponents, β, (were 2.04, 1.83, and 1.73 respectively. For all graphs p-value > 0.55, suggesting that the power law does provide a plausible model of the data.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3473240&req=5

Figure 1: The cumulative distribution function Pr(x) plotted as a function of frequency (x) for GO gene annotations contained within Human GOA. The straight line shows the region of the plots for which a power law was found to provide a good model of the data [25]. 1(a) Annotation from the biological process sub-ontology, 1(b) annotation from the molecular function sub-ontology, and 1 (c) annotation from the cellular component sub-ontology The measured power law exponents, β, (were 2.04, 1.83, and 1.73 respectively. For all graphs p-value > 0.55, suggesting that the power law does provide a plausible model of the data.

Mentions: Figure 1 shows the log-log plots of cumulative frequency vs. term rank (Pareto plots) for data from the human GOA. It can be seen from these figures that there is strong support for a power law model for these data for the annotations from all three sub-ontologies, as demonstrated in the P-values returned from the fitting software.


The language of gene ontology: a Zipf's law analysis.

Kalankesh LR, Stevens R, Brass A - BMC Bioinformatics (2012)

The cumulative distribution function Pr(x) plotted as a function of frequency (x) for GO gene annotations contained within Human GOA. The straight line shows the region of the plots for which a power law was found to provide a good model of the data [25]. 1(a) Annotation from the biological process sub-ontology, 1(b) annotation from the molecular function sub-ontology, and 1 (c) annotation from the cellular component sub-ontology The measured power law exponents, β, (were 2.04, 1.83, and 1.73 respectively. For all graphs p-value > 0.55, suggesting that the power law does provide a plausible model of the data.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3473240&req=5

Figure 1: The cumulative distribution function Pr(x) plotted as a function of frequency (x) for GO gene annotations contained within Human GOA. The straight line shows the region of the plots for which a power law was found to provide a good model of the data [25]. 1(a) Annotation from the biological process sub-ontology, 1(b) annotation from the molecular function sub-ontology, and 1 (c) annotation from the cellular component sub-ontology The measured power law exponents, β, (were 2.04, 1.83, and 1.73 respectively. For all graphs p-value > 0.55, suggesting that the power law does provide a plausible model of the data.
Mentions: Figure 1 shows the log-log plots of cumulative frequency vs. term rank (Pareto plots) for data from the human GOA. It can be seen from these figures that there is strong support for a power law model for these data for the annotations from all three sub-ontologies, as demonstrated in the P-values returned from the fitting software.

Bottom Line: Annotations from the Gene Ontology Annotation project were found to follow Zipf's law.On filtering the corpora using GO evidence codes we found that the value of the measured power law exponent responded in a predictable way as a function of the evidence codes used to support the annotation.GO annotations show similar statistical behaviours to those seen in natural language with measured exponents that provide a signal which correlates with the nature of the evidence codes used to support the annotations, suggesting that the measured exponent might provide a signal regarding the information content of the annotation.

View Article: PubMed Central - HTML - PubMed

Affiliation: School of Computer Science, University of Manchester, Oxford Road, Manchester M13 9PL, UK.

ABSTRACT

Background: Most major genome projects and sequence databases provide a GO annotation of their data, either automatically or through human annotators, creating a large corpus of data written in the language of GO. Texts written in natural language show a statistical power law behaviour, Zipf's law, the exponent of which can provide useful information on the nature of the language being used. We have therefore explored the hypothesis that collections of GO annotations will show similar statistical behaviours to natural language.

Results: Annotations from the Gene Ontology Annotation project were found to follow Zipf's law. Surprisingly, the measured power law exponents were consistently different between annotation captured using the three GO sub-ontologies in the corpora (function, process and component). On filtering the corpora using GO evidence codes we found that the value of the measured power law exponent responded in a predictable way as a function of the evidence codes used to support the annotation.

Conclusions: Techniques from computational linguistics can provide new insights into the annotation process. GO annotations show similar statistical behaviours to those seen in natural language with measured exponents that provide a signal which correlates with the nature of the evidence codes used to support the annotations, suggesting that the measured exponent might provide a signal regarding the information content of the annotation.

Show MeSH