Exploiting ontology graph for predicting sparsely annotated gene function.
Bottom Line: There exist a variety of algorithms for function prediction, but none properly address this 'overfitting' issue of sparsely annotated functions, or do so in a manner scalable to tens of thousands of functions in the human catalog.Our method is scalable to datasets with a large number of annotations.In a cross-validation experiment in yeast, mouse and human, our method greatly outperformed previous state-of-the-art function prediction algorithms in predicting sparsely annotated functions, without sacrificing the performance on labels with sufficient information.
Affiliation: Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA and Department of Mathematics, MIT, Cambridge, MA, USA.Show MeSH
Related in: MedlinePlus
Mentions: As an overview, clusDCA first computes the ‘diffusion state’ of each node by performing a RWR on each input network, and subsequently finds a low-dimensional vector representation for each gene via an efficient matrix factorization of the diffusion states. A key contribution is that clusDCA then follows an analogous procedure to obtain a low-dimensional vector representation of each functional label based on the ontology graph. Intuitively, the gene vectors encode the topology of the interactome, which in turn reflects gene function, while the label vectors encode the topology of the ontology graph, which reflects the semantic and relational properties of the labels. Given both the gene and the label vectors, clusDCA novelly finds the best projection of the gene vectors onto the label vector space, thus keeping the projected gene vectors geometrically close to their known labels. In the final step, clusDCA computes its predictions for an uncharacterized gene by sorting the candidate functions by their proximity to the projected gene vector, based on the optimal projection. An illustration of this pipeline is given in Figure 2. We give a more detailed description of this pipeline below.Fig. 2.
Affiliation: Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA and Department of Mathematics, MIT, Cambridge, MA, USA.