Limits...
Homology-based inference sets the bar high for protein function prediction.

Hamp T, Kassner R, Seemayer S, Vicedo E, Schaefer C, Achten D, Auer F, Boehm A, Braun T, Hecht M, Heron M, Hönigschmid P, Hopf TA, Kaufmann S, Kiening M, Krompass D, Landerer C, Mahlich Y, Roos M, Rost B - BMC Bioinformatics (2013)

Bottom Line: Firstly, our most successful implementation for the baseline ranked very high at CAFA1.It puts more emphasis on the details of protein function than the other measures employed by CAFA and may best reflect the expectations of users.Clearly, the definition of proper goals remains one major objective for CAFA.

View Article: PubMed Central - HTML - PubMed

Affiliation: TUM, Department of Informatics, Bioinformatics & Computational Biology - I12 Boltzmannstr, 3, 85748 Garching/Munich, Germany.

ABSTRACT

Background: Any method that de novo predicts protein function should do better than random. More challenging, it also ought to outperform simple homology-based inference.

Methods: Here, we describe a few methods that predict protein function exclusively through homology. Together, they set the bar or lower limit for future improvements.

Results and conclusions: During the development of these methods, we faced two surprises. Firstly, our most successful implementation for the baseline ranked very high at CAFA1. In fact, our best combination of homology-based methods fared only slightly worse than the top-of-the-line prediction method from the Jones group. Secondly, although the concept of homology-based inference is simple, this work revealed that the precise details of the implementation are crucial: not only did the methods span from top to bottom performers at CAFA, but also the reasons for these differences were unexpected. In this work, we also propose a new rigorous measure to compare predicted and experimental annotations. It puts more emphasis on the details of protein function than the other measures employed by CAFA and may best reflect the expectations of users. Clearly, the definition of proper goals remains one major objective for CAFA.

Show MeSH

Related in: MedlinePlus

Flow chart of StudentB. StudentB first logarithmizes the E-Values of all BLAST hits and averages them. The result is mapped into a range from 0 to 1 by looking up its percentile in a precompiled distribution. This percentile is the Template Quality Score and reflects how well we can predict the entire target. To score single terms, we multiply it with the score of each predicted leaf, i.e. the Combined Leaf Score. This is the average of the logarithmized E-Values of the nodes on the path from the leaf to the root. The propagation of the leaf terms and their scores is output to the user.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3584931&req=5

Figure 3: Flow chart of StudentB. StudentB first logarithmizes the E-Values of all BLAST hits and averages them. The result is mapped into a range from 0 to 1 by looking up its percentile in a precompiled distribution. This percentile is the Template Quality Score and reflects how well we can predict the entire target. To score single terms, we multiply it with the score of each predicted leaf, i.e. the Combined Leaf Score. This is the average of the logarithmized E-Values of the nodes on the path from the leaf to the root. The propagation of the leaf terms and their scores is output to the user.

Mentions: Next, cluster all branches so that each pair from two different clusters overlaps less than 10%. For each cluster, the branch with the longest path to the root is chosen, reduced to its original leaf term with the original score and output to the user. As the redundancy reduction may filter out highly supported terms, we apply a final correction: if any pair of branches from previous steps overlaps over 90%, the term common to both and with the longest path to the root, i.e., the lowest common ancestor, is added to the result.


Homology-based inference sets the bar high for protein function prediction.

Hamp T, Kassner R, Seemayer S, Vicedo E, Schaefer C, Achten D, Auer F, Boehm A, Braun T, Hecht M, Heron M, Hönigschmid P, Hopf TA, Kaufmann S, Kiening M, Krompass D, Landerer C, Mahlich Y, Roos M, Rost B - BMC Bioinformatics (2013)

Flow chart of StudentB. StudentB first logarithmizes the E-Values of all BLAST hits and averages them. The result is mapped into a range from 0 to 1 by looking up its percentile in a precompiled distribution. This percentile is the Template Quality Score and reflects how well we can predict the entire target. To score single terms, we multiply it with the score of each predicted leaf, i.e. the Combined Leaf Score. This is the average of the logarithmized E-Values of the nodes on the path from the leaf to the root. The propagation of the leaf terms and their scores is output to the user.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3584931&req=5

Figure 3: Flow chart of StudentB. StudentB first logarithmizes the E-Values of all BLAST hits and averages them. The result is mapped into a range from 0 to 1 by looking up its percentile in a precompiled distribution. This percentile is the Template Quality Score and reflects how well we can predict the entire target. To score single terms, we multiply it with the score of each predicted leaf, i.e. the Combined Leaf Score. This is the average of the logarithmized E-Values of the nodes on the path from the leaf to the root. The propagation of the leaf terms and their scores is output to the user.
Mentions: Next, cluster all branches so that each pair from two different clusters overlaps less than 10%. For each cluster, the branch with the longest path to the root is chosen, reduced to its original leaf term with the original score and output to the user. As the redundancy reduction may filter out highly supported terms, we apply a final correction: if any pair of branches from previous steps overlaps over 90%, the term common to both and with the longest path to the root, i.e., the lowest common ancestor, is added to the result.

Bottom Line: Firstly, our most successful implementation for the baseline ranked very high at CAFA1.It puts more emphasis on the details of protein function than the other measures employed by CAFA and may best reflect the expectations of users.Clearly, the definition of proper goals remains one major objective for CAFA.

View Article: PubMed Central - HTML - PubMed

Affiliation: TUM, Department of Informatics, Bioinformatics & Computational Biology - I12 Boltzmannstr, 3, 85748 Garching/Munich, Germany.

ABSTRACT

Background: Any method that de novo predicts protein function should do better than random. More challenging, it also ought to outperform simple homology-based inference.

Methods: Here, we describe a few methods that predict protein function exclusively through homology. Together, they set the bar or lower limit for future improvements.

Results and conclusions: During the development of these methods, we faced two surprises. Firstly, our most successful implementation for the baseline ranked very high at CAFA1. In fact, our best combination of homology-based methods fared only slightly worse than the top-of-the-line prediction method from the Jones group. Secondly, although the concept of homology-based inference is simple, this work revealed that the precise details of the implementation are crucial: not only did the methods span from top to bottom performers at CAFA, but also the reasons for these differences were unexpected. In this work, we also propose a new rigorous measure to compare predicted and experimental annotations. It puts more emphasis on the details of protein function than the other measures employed by CAFA and may best reflect the expectations of users. Clearly, the definition of proper goals remains one major objective for CAFA.

Show MeSH
Related in: MedlinePlus