Limits...
Predicting protein functions using incomplete hierarchical labels.

Yu G, Zhu H, Domeniconi C - BMC Bioinformatics (2015)

Bottom Line: Protein function prediction is to assign biological or biochemical functions to proteins, and it is a challenging computational problem characterized by several factors: (1) the number of function labels (annotations) is large; (2) a protein may be associated with multiple labels; (3) the function labels are structured in a hierarchy; and (4) the labels are incomplete.The proposed method (PILL) can serve as a valuable tool for protein function prediction using incomplete labels.The Matlab code of PILL is available upon request.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong, China. guoxian85@gmail.com.

ABSTRACT

Background: Protein function prediction is to assign biological or biochemical functions to proteins, and it is a challenging computational problem characterized by several factors: (1) the number of function labels (annotations) is large; (2) a protein may be associated with multiple labels; (3) the function labels are structured in a hierarchy; and (4) the labels are incomplete. Current predictive models often assume that the labels of the labeled proteins are complete, i.e. no label is missing. But in real scenarios, we may be aware of only some hierarchical labels of a protein, and we may not know whether additional ones are actually present. The scenario of incomplete hierarchical labels, a challenging and practical problem, is seldom studied in protein function prediction.

Results: In this paper, we propose an algorithm to Predict protein functions using Incomplete hierarchical LabeLs (PILL in short). PILL takes into account the hierarchical and the flat taxonomy similarity between function labels, and defines a Combined Similarity (ComSim) to measure the correlation between labels. PILL estimates the missing labels for a protein based on ComSim and the known labels of the protein, and uses a regularization to exploit the interactions between proteins for function prediction. PILL is shown to outperform other related techniques in replenishing the missing labels and in predicting the functions of completely unlabeled proteins on publicly available PPI datasets annotated with MIPS Functional Catalogue and Gene Ontology labels.

Conclusion: The empirical study shows that it is important to consider the incomplete annotation for protein function prediction. The proposed method (PILL) can serve as a valuable tool for protein function prediction using incomplete labels. The Matlab code of PILL is available upon request.

Show MeSH

Related in: MedlinePlus

Illustration of incomplete hierarchical labels for proteins annotated with MIPS FunCat labels. A rectangle represents a protein (pi, i∈{1,2,3}); an ellipse denotes a function label, and a undirected line between rectangles captures a protein-protein interaction (the more reliable the interaction is, the thicker the line is). All the functional labels (including the missing function labels denoted by color ellipses with question marks ‘?’) in the ellipses should be associated with the proteins, but only the functional labels in the white ellipses are known. For better visualization, other functional labels (i.e., ‘12.03’ and ‘03’, which are not ground-truth labels for these proteins), are not plotted in the Figure.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4384381&req=5

Fig1: Illustration of incomplete hierarchical labels for proteins annotated with MIPS FunCat labels. A rectangle represents a protein (pi, i∈{1,2,3}); an ellipse denotes a function label, and a undirected line between rectangles captures a protein-protein interaction (the more reliable the interaction is, the thicker the line is). All the functional labels (including the missing function labels denoted by color ellipses with question marks ‘?’) in the ellipses should be associated with the proteins, but only the functional labels in the white ellipses are known. For better visualization, other functional labels (i.e., ‘12.03’ and ‘03’, which are not ground-truth labels for these proteins), are not plotted in the Figure.

Mentions: Figure 1 illustrates an example of an incomplete hierarchical label problem for proteins annotated with FunCat labels. A corresponding example for the GO labels is given in Figure S1 of the Additional file 1. In Figure 1, p1 and p2 are partially labeled (missing labels are described by a question mark ?), and p3 is completely unlabeled. Note, other FunCat labels (i.e., ‘12.03’ and ‘03’) are not really missing for these proteins, and thus not shown in the figure; these function labels will also be viewed as candidate ‘missing’ labels. The missing labels are leaf function labels. If a non-leaf function label of a protein is missing, we can directly append this function label to this protein from its descendant function labels. Each hierarchy of non-leaf and leaf function labels is defined with respect to a single protein. For example, ‘12.04’ is a leaf function label for p2, but it is a non-leaf function label for p1, since p1 is labeled with a descendant label (‘12.04.02’) of ‘12.04’. Our task is to replenish the missing labels of p1 and p2, and to predict functions for p3. To this end, we define three kinds of relationships between function labels: (i) parent-child (e.g., ‘01.03’ is a child function label of ‘01’); (ii) grandparent-grandson (e.g., ‘01.03.02’ is a grandson label of ‘01’); and (iii) uncle-nephew (e.g., if we consider ‘01’ as a sibling of ‘12’, although these two labels do not have an explicit common parent label, ‘12’ is an uncle label of ‘01.03’). These relationships will be further discussed in the next Section.Figure 1


Predicting protein functions using incomplete hierarchical labels.

Yu G, Zhu H, Domeniconi C - BMC Bioinformatics (2015)

Illustration of incomplete hierarchical labels for proteins annotated with MIPS FunCat labels. A rectangle represents a protein (pi, i∈{1,2,3}); an ellipse denotes a function label, and a undirected line between rectangles captures a protein-protein interaction (the more reliable the interaction is, the thicker the line is). All the functional labels (including the missing function labels denoted by color ellipses with question marks ‘?’) in the ellipses should be associated with the proteins, but only the functional labels in the white ellipses are known. For better visualization, other functional labels (i.e., ‘12.03’ and ‘03’, which are not ground-truth labels for these proteins), are not plotted in the Figure.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4384381&req=5

Fig1: Illustration of incomplete hierarchical labels for proteins annotated with MIPS FunCat labels. A rectangle represents a protein (pi, i∈{1,2,3}); an ellipse denotes a function label, and a undirected line between rectangles captures a protein-protein interaction (the more reliable the interaction is, the thicker the line is). All the functional labels (including the missing function labels denoted by color ellipses with question marks ‘?’) in the ellipses should be associated with the proteins, but only the functional labels in the white ellipses are known. For better visualization, other functional labels (i.e., ‘12.03’ and ‘03’, which are not ground-truth labels for these proteins), are not plotted in the Figure.
Mentions: Figure 1 illustrates an example of an incomplete hierarchical label problem for proteins annotated with FunCat labels. A corresponding example for the GO labels is given in Figure S1 of the Additional file 1. In Figure 1, p1 and p2 are partially labeled (missing labels are described by a question mark ?), and p3 is completely unlabeled. Note, other FunCat labels (i.e., ‘12.03’ and ‘03’) are not really missing for these proteins, and thus not shown in the figure; these function labels will also be viewed as candidate ‘missing’ labels. The missing labels are leaf function labels. If a non-leaf function label of a protein is missing, we can directly append this function label to this protein from its descendant function labels. Each hierarchy of non-leaf and leaf function labels is defined with respect to a single protein. For example, ‘12.04’ is a leaf function label for p2, but it is a non-leaf function label for p1, since p1 is labeled with a descendant label (‘12.04.02’) of ‘12.04’. Our task is to replenish the missing labels of p1 and p2, and to predict functions for p3. To this end, we define three kinds of relationships between function labels: (i) parent-child (e.g., ‘01.03’ is a child function label of ‘01’); (ii) grandparent-grandson (e.g., ‘01.03.02’ is a grandson label of ‘01’); and (iii) uncle-nephew (e.g., if we consider ‘01’ as a sibling of ‘12’, although these two labels do not have an explicit common parent label, ‘12’ is an uncle label of ‘01.03’). These relationships will be further discussed in the next Section.Figure 1

Bottom Line: Protein function prediction is to assign biological or biochemical functions to proteins, and it is a challenging computational problem characterized by several factors: (1) the number of function labels (annotations) is large; (2) a protein may be associated with multiple labels; (3) the function labels are structured in a hierarchy; and (4) the labels are incomplete.The proposed method (PILL) can serve as a valuable tool for protein function prediction using incomplete labels.The Matlab code of PILL is available upon request.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong, China. guoxian85@gmail.com.

ABSTRACT

Background: Protein function prediction is to assign biological or biochemical functions to proteins, and it is a challenging computational problem characterized by several factors: (1) the number of function labels (annotations) is large; (2) a protein may be associated with multiple labels; (3) the function labels are structured in a hierarchy; and (4) the labels are incomplete. Current predictive models often assume that the labels of the labeled proteins are complete, i.e. no label is missing. But in real scenarios, we may be aware of only some hierarchical labels of a protein, and we may not know whether additional ones are actually present. The scenario of incomplete hierarchical labels, a challenging and practical problem, is seldom studied in protein function prediction.

Results: In this paper, we propose an algorithm to Predict protein functions using Incomplete hierarchical LabeLs (PILL in short). PILL takes into account the hierarchical and the flat taxonomy similarity between function labels, and defines a Combined Similarity (ComSim) to measure the correlation between labels. PILL estimates the missing labels for a protein based on ComSim and the known labels of the protein, and uses a regularization to exploit the interactions between proteins for function prediction. PILL is shown to outperform other related techniques in replenishing the missing labels and in predicting the functions of completely unlabeled proteins on publicly available PPI datasets annotated with MIPS Functional Catalogue and Gene Ontology labels.

Conclusion: The empirical study shows that it is important to consider the incomplete annotation for protein function prediction. The proposed method (PILL) can serve as a valuable tool for protein function prediction using incomplete labels. The Matlab code of PILL is available upon request.

Show MeSH
Related in: MedlinePlus