Limits...
Modifying the DPClus algorithm for identifying protein complexes based on new topological structures.

Li M, Chen JE, Wang JX, Hu B, Chen G - BMC Bioinformatics (2008)

Bottom Line: Identification of protein complexes is crucial for understanding principles of cellular organization and functions.As the size of protein-protein interaction set increases, a general trend is to represent the interactions as a network and to develop effective algorithms to detect significant complexes in such networks.Based on the study of known complexes in protein networks, this paper proposes a new topological structure for protein complexes, which is a combination of subgraph diameter (or average vertex distance) and subgraph density.

View Article: PubMed Central - HTML - PubMed

Affiliation: School of Information Science and Engineering, Central South University, Changsha, Hunan 410083, PR China. limin@mail.csu.edu.cn

ABSTRACT

Background: Identification of protein complexes is crucial for understanding principles of cellular organization and functions. As the size of protein-protein interaction set increases, a general trend is to represent the interactions as a network and to develop effective algorithms to detect significant complexes in such networks.

Results: Based on the study of known complexes in protein networks, this paper proposes a new topological structure for protein complexes, which is a combination of subgraph diameter (or average vertex distance) and subgraph density. Following the approach of that of the previously proposed clustering algorithm DPClus which expands clusters starting from seeded vertices, we present a clustering algorithm IPCA based on the new topological structure for identifying complexes in large protein interaction networks. The algorithm IPCA is applied to the protein interaction network of Sacchromyces cerevisiae and identifies many well known complexes. Experimental results show that the algorithm IPCA recalls more known complexes than previously proposed clustering algorithms, including DPClus, CFinder, LCMA, MCODE, RNSC and STM.

Conclusion: The proposed algorithm based on the new topological structure makes it possible to identify dense subgraphs in protein interaction networks, many of which correspond to known protein complexes. The algorithm is robust to the known high rate of false positives and false negatives in data from high-throughout interaction techniques. The program is available at http://netlab.csu.edu.cn/bioinformatics/limin/IPCA.

Show MeSH
Comparison of the predicted clusters with the known complexes. The number of matched known complexes with respect to different overlapping scores for different sets generated by IPCA using different parameters.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2570695&req=5

Figure 4: Comparison of the predicted clusters with the known complexes. The number of matched known complexes with respect to different overlapping scores for different sets generated by IPCA using different parameters.

Mentions: where /VPc ∩ VKc/ is the size of the intersection set of the predicted cluster and the known complex, /VPc/ is the size of the predicted cluster and /VKc/ is the size of the known complex. A known complex Kc that has no proteins in a predicted cluster Pc has OS(Pc, Kc) = 0 and a known complex Kc that perfectly matches a predicted cluster Pc has OS(Pc, Kc) = 1. A known complex and a predicted cluster are considered as a match if their overlapping score is equal to or larger than a specific threshold. The numbers of matched known complexes with respect to different overlapping score threshold (from 0 to 1 with a 0.1 increment) are shown in Figure 4. The best matching result is obtained when Tin = 0.9 for both SP ≤ 2 and ASP ≤ 2. There are 165 known complexes matched when the overlapping score threshold is 0.2. There are 28 known complexes matched perfectly. When Tin ≥ 0.5, the number of matched known complexes is almost the same for SP ≤ 2 and ASP ≤ 2. When Tin ≤ 0.5, the number of matched known complexes is larger for SP ≤ 2 than for ASP ≤ 2. The probability that a known complex is matched perfectly by a cluster in which proteins are picked up randomly is determined by the size of the network and the known complex. The probability that a known complex with size = 3 matches perfectly by a cluster selected randomly in the yeast network used in this paper is 6.39 * 10-11. It is very obvious that more known complexes matched by the predicted clusters implies that the algorithm is more effective to detect complexes. Sensitivity and specificity are two important aspects to estimate the performance of algorithms for detecting protein complexes. Sensitivity is the fraction of the true-positive predictions out of all the true predictions, defined by the following formula:


Modifying the DPClus algorithm for identifying protein complexes based on new topological structures.

Li M, Chen JE, Wang JX, Hu B, Chen G - BMC Bioinformatics (2008)

Comparison of the predicted clusters with the known complexes. The number of matched known complexes with respect to different overlapping scores for different sets generated by IPCA using different parameters.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2570695&req=5

Figure 4: Comparison of the predicted clusters with the known complexes. The number of matched known complexes with respect to different overlapping scores for different sets generated by IPCA using different parameters.
Mentions: where /VPc ∩ VKc/ is the size of the intersection set of the predicted cluster and the known complex, /VPc/ is the size of the predicted cluster and /VKc/ is the size of the known complex. A known complex Kc that has no proteins in a predicted cluster Pc has OS(Pc, Kc) = 0 and a known complex Kc that perfectly matches a predicted cluster Pc has OS(Pc, Kc) = 1. A known complex and a predicted cluster are considered as a match if their overlapping score is equal to or larger than a specific threshold. The numbers of matched known complexes with respect to different overlapping score threshold (from 0 to 1 with a 0.1 increment) are shown in Figure 4. The best matching result is obtained when Tin = 0.9 for both SP ≤ 2 and ASP ≤ 2. There are 165 known complexes matched when the overlapping score threshold is 0.2. There are 28 known complexes matched perfectly. When Tin ≥ 0.5, the number of matched known complexes is almost the same for SP ≤ 2 and ASP ≤ 2. When Tin ≤ 0.5, the number of matched known complexes is larger for SP ≤ 2 than for ASP ≤ 2. The probability that a known complex is matched perfectly by a cluster in which proteins are picked up randomly is determined by the size of the network and the known complex. The probability that a known complex with size = 3 matches perfectly by a cluster selected randomly in the yeast network used in this paper is 6.39 * 10-11. It is very obvious that more known complexes matched by the predicted clusters implies that the algorithm is more effective to detect complexes. Sensitivity and specificity are two important aspects to estimate the performance of algorithms for detecting protein complexes. Sensitivity is the fraction of the true-positive predictions out of all the true predictions, defined by the following formula:

Bottom Line: Identification of protein complexes is crucial for understanding principles of cellular organization and functions.As the size of protein-protein interaction set increases, a general trend is to represent the interactions as a network and to develop effective algorithms to detect significant complexes in such networks.Based on the study of known complexes in protein networks, this paper proposes a new topological structure for protein complexes, which is a combination of subgraph diameter (or average vertex distance) and subgraph density.

View Article: PubMed Central - HTML - PubMed

Affiliation: School of Information Science and Engineering, Central South University, Changsha, Hunan 410083, PR China. limin@mail.csu.edu.cn

ABSTRACT

Background: Identification of protein complexes is crucial for understanding principles of cellular organization and functions. As the size of protein-protein interaction set increases, a general trend is to represent the interactions as a network and to develop effective algorithms to detect significant complexes in such networks.

Results: Based on the study of known complexes in protein networks, this paper proposes a new topological structure for protein complexes, which is a combination of subgraph diameter (or average vertex distance) and subgraph density. Following the approach of that of the previously proposed clustering algorithm DPClus which expands clusters starting from seeded vertices, we present a clustering algorithm IPCA based on the new topological structure for identifying complexes in large protein interaction networks. The algorithm IPCA is applied to the protein interaction network of Sacchromyces cerevisiae and identifies many well known complexes. Experimental results show that the algorithm IPCA recalls more known complexes than previously proposed clustering algorithms, including DPClus, CFinder, LCMA, MCODE, RNSC and STM.

Conclusion: The proposed algorithm based on the new topological structure makes it possible to identify dense subgraphs in protein interaction networks, many of which correspond to known protein complexes. The algorithm is robust to the known high rate of false positives and false negatives in data from high-throughout interaction techniques. The program is available at http://netlab.csu.edu.cn/bioinformatics/limin/IPCA.

Show MeSH