Limits...
Mining phenotypes for gene function prediction.

Groth P, Weiss B, Pohlenz HD, Leser U - BMC Bioinformatics (2008)

Bottom Line: We present results on a study where we use a large set of phenotype data - in textual form - to predict gene annotation.We manually verified some of these clusters and found them to exhibit high biological coherence, e.g. a group containing all available antennal Drosophila odorant receptors despite inconsistent GO-annotations.We show that text clustering can play an important role in this process.

View Article: PubMed Central - HTML - PubMed

Affiliation: Research Laboratories of Bayer Schering Pharma AG, Berlin, Germany. groth@informatik.hu-berlin.de

ABSTRACT

Background: Health and disease of organisms are reflected in their phenotypes. Often, a genetic component to a disease is discovered only after clearly defining its phenotype. In the past years, many technologies to systematically generate phenotypes in a high-throughput manner, such as RNA interference or gene knock-out, have been developed and used to decipher functions for genes. However, there have been relatively few efforts to make use of phenotype data beyond the single genotype-phenotype relationships.

Results: We present results on a study where we use a large set of phenotype data - in textual form - to predict gene annotation. To this end, we use text clustering to group genes based on their phenotype descriptions. We show that these clusters correlate well with several indicators for biological coherence in gene groups, such as functional annotations from the Gene Ontology (GO) and protein-protein interactions. We exploit these clusters for predicting gene function by carrying over annotations from well-annotated genes to other, less-characterized genes in the same cluster. For a subset of groups selected by applying objective criteria, we can predict GO-term annotations from the biological process sub-ontology with up to 72.6% precision and 16.7% recall, as evaluated by cross-validation. We manually verified some of these clusters and found them to exhibit high biological coherence, e.g. a group containing all available antennal Drosophila odorant receptors despite inconsistent GO-annotations.

Conclusion: The intrinsic nature of phenotypes to visibly reflect genetic activity underlines their usefulness in inferring new gene functions. Thus, systematically analyzing these data on a large scale offers many possibilities for inferring functional annotation of genes. We show that text clustering can play an important role in this process.

Show MeSH
Cross-species phenotype data distribution. The left pie chart depicts the distribution of genes by species, i.e. the relative number of genes in our gene set according to species affiliation. The right pie chart shows the distribution of clusters according to single species or 'mixed', if the cluster is made up of genes from more than one species.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2311305&req=5

Figure 2: Cross-species phenotype data distribution. The left pie chart depicts the distribution of genes by species, i.e. the relative number of genes in our gene set according to species affiliation. The right pie chart shows the distribution of clusters according to single species or 'mixed', if the cluster is made up of genes from more than one species.

Mentions: Of the 1,000 clusters, 90.4% are single species. Figure 1 shows the distribution of clusters into different sizes. Figure 2 details the distribution of genes by species (independent of the clustering) and the distribution of species in clusters (dependent on the clustering).


Mining phenotypes for gene function prediction.

Groth P, Weiss B, Pohlenz HD, Leser U - BMC Bioinformatics (2008)

Cross-species phenotype data distribution. The left pie chart depicts the distribution of genes by species, i.e. the relative number of genes in our gene set according to species affiliation. The right pie chart shows the distribution of clusters according to single species or 'mixed', if the cluster is made up of genes from more than one species.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2311305&req=5

Figure 2: Cross-species phenotype data distribution. The left pie chart depicts the distribution of genes by species, i.e. the relative number of genes in our gene set according to species affiliation. The right pie chart shows the distribution of clusters according to single species or 'mixed', if the cluster is made up of genes from more than one species.
Mentions: Of the 1,000 clusters, 90.4% are single species. Figure 1 shows the distribution of clusters into different sizes. Figure 2 details the distribution of genes by species (independent of the clustering) and the distribution of species in clusters (dependent on the clustering).

Bottom Line: We present results on a study where we use a large set of phenotype data - in textual form - to predict gene annotation.We manually verified some of these clusters and found them to exhibit high biological coherence, e.g. a group containing all available antennal Drosophila odorant receptors despite inconsistent GO-annotations.We show that text clustering can play an important role in this process.

View Article: PubMed Central - HTML - PubMed

Affiliation: Research Laboratories of Bayer Schering Pharma AG, Berlin, Germany. groth@informatik.hu-berlin.de

ABSTRACT

Background: Health and disease of organisms are reflected in their phenotypes. Often, a genetic component to a disease is discovered only after clearly defining its phenotype. In the past years, many technologies to systematically generate phenotypes in a high-throughput manner, such as RNA interference or gene knock-out, have been developed and used to decipher functions for genes. However, there have been relatively few efforts to make use of phenotype data beyond the single genotype-phenotype relationships.

Results: We present results on a study where we use a large set of phenotype data - in textual form - to predict gene annotation. To this end, we use text clustering to group genes based on their phenotype descriptions. We show that these clusters correlate well with several indicators for biological coherence in gene groups, such as functional annotations from the Gene Ontology (GO) and protein-protein interactions. We exploit these clusters for predicting gene function by carrying over annotations from well-annotated genes to other, less-characterized genes in the same cluster. For a subset of groups selected by applying objective criteria, we can predict GO-term annotations from the biological process sub-ontology with up to 72.6% precision and 16.7% recall, as evaluated by cross-validation. We manually verified some of these clusters and found them to exhibit high biological coherence, e.g. a group containing all available antennal Drosophila odorant receptors despite inconsistent GO-annotations.

Conclusion: The intrinsic nature of phenotypes to visibly reflect genetic activity underlines their usefulness in inferring new gene functions. Thus, systematically analyzing these data on a large scale offers many possibilities for inferring functional annotation of genes. We show that text clustering can play an important role in this process.

Show MeSH