Limits...
Relating diseases by integrating gene associations and information flow through protein interaction network.

Hamaneh MB, Yu YK - PLoS ONE (2014)

Bottom Line: We have also compared our results to those of MimMiner, a text-mining method that assigns pairwise similarity scores to diseases.We find the results of the two methods to be complementary.Although not needed for understanding this paper, the raw results are available for download for further study at ftp://ftp.ncbi.nlm.nih.gov/pub/qmbpmn/DiseaseRelations/.

View Article: PubMed Central - PubMed

Affiliation: National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, United States of America.

ABSTRACT
Identifying similar diseases could potentially provide deeper understanding of their underlying causes, and may even hint at possible treatments. For this purpose, it is necessary to have a similarity measure that reflects the underpinning molecular interactions and biological pathways. We have thus devised a network-based measure that can partially fulfill this goal. Our method assigns weights to all proteins (and consequently their encoding genes) by using information flow from a disease to the protein interaction network and back. Similarity between two diseases is then defined as the cosine of the angle between their corresponding weight vectors. The proposed method also provides a way to suggest disease-pathway associations by using the weights assigned to the genes to perform enrichment analysis for each disease. By calculating pairwise similarities between 2534 diseases, we show that our disease similarity measure is strongly correlated with the probability of finding the diseases in the same disease family and, more importantly, sharing biological pathways. We have also compared our results to those of MimMiner, a text-mining method that assigns pairwise similarity scores to diseases. We find the results of the two methods to be complementary. It is also shown that clustering diseases based on their similarities and performing enrichment analysis for the cluster centers significantly increases the term association rate, suggesting that the cluster centers are better representatives for biological pathways than the diseases themselves. This lends support to the view that our similarity measure is a good indicator of relatedness of biological processes involved in causing the diseases. Although not needed for understanding this paper, the raw results are available for download for further study at ftp://ftp.ncbi.nlm.nih.gov/pub/qmbpmn/DiseaseRelations/.

Show MeSH

Related in: MedlinePlus

The effect of clustering on the minimum term size.The minimum term size distribution of (A) GO and (B) KEGG terms reported by SaddleSum enrichment analyses when using disease weight vectors directly (red curves) and when using cluster center vectors (blue curves). Not only the most informative (smallest size) terms are preserved during clustering, the clustering procedure seems to shift the minimum term size distribution towards the small end, indicating the likelihood of providing even more specific terms when weight vectors are grouped under the proposed clustering procedure.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC4216010&req=5

pone-0110936-g004: The effect of clustering on the minimum term size.The minimum term size distribution of (A) GO and (B) KEGG terms reported by SaddleSum enrichment analyses when using disease weight vectors directly (red curves) and when using cluster center vectors (blue curves). Not only the most informative (smallest size) terms are preserved during clustering, the clustering procedure seems to shift the minimum term size distribution towards the small end, indicating the likelihood of providing even more specific terms when weight vectors are grouped under the proposed clustering procedure.

Mentions: Our iterative approach resulted in 1707 clusters. Enrichment analysis was run for all cluster centers obtained in this stage and found significant hits for 1301 clusters with an average of 70.9/7.5 GO/KEGG terms per cluster, which was higher than the average number of terms found for the diseases. The probabilities of belonging to different clusters were calculated for each disease and were used to determine the percentage of diseases with term hits, defined by(7)with being an indicator function taking value when cluster has a term hit and otherwise. Interestingly, the number of such diseases showed an increase from 60% (when enrichment was directly performed for the diseases) to 85%. For the diseases that had term hits using both methods (direct and through clustering) the term similarity , was calculated using(8)with being the number of diseases that have significant term hits, being the set of terms associated with the th disease, being the set of those assigned to the cluster , and denoting the number of members in the set . We found . This seems to indicate that more than of the terms associated with the diseases were dropped upon merging to clusters and some information might have been lost in the process. What is really important, however, is whether terms of small number of annotated genes are preserved, as these terms are most specific and usually most informative. Upon examining the distribution of minimum GO/KEGG term size (number of annotated genes for that term) when running SaddleSum using diseases directly and using cluster centers, we find that the most informative terms are largely kept in the process. The distribution of the minimum term size is shown in Fig. 4.


Relating diseases by integrating gene associations and information flow through protein interaction network.

Hamaneh MB, Yu YK - PLoS ONE (2014)

The effect of clustering on the minimum term size.The minimum term size distribution of (A) GO and (B) KEGG terms reported by SaddleSum enrichment analyses when using disease weight vectors directly (red curves) and when using cluster center vectors (blue curves). Not only the most informative (smallest size) terms are preserved during clustering, the clustering procedure seems to shift the minimum term size distribution towards the small end, indicating the likelihood of providing even more specific terms when weight vectors are grouped under the proposed clustering procedure.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC4216010&req=5

pone-0110936-g004: The effect of clustering on the minimum term size.The minimum term size distribution of (A) GO and (B) KEGG terms reported by SaddleSum enrichment analyses when using disease weight vectors directly (red curves) and when using cluster center vectors (blue curves). Not only the most informative (smallest size) terms are preserved during clustering, the clustering procedure seems to shift the minimum term size distribution towards the small end, indicating the likelihood of providing even more specific terms when weight vectors are grouped under the proposed clustering procedure.
Mentions: Our iterative approach resulted in 1707 clusters. Enrichment analysis was run for all cluster centers obtained in this stage and found significant hits for 1301 clusters with an average of 70.9/7.5 GO/KEGG terms per cluster, which was higher than the average number of terms found for the diseases. The probabilities of belonging to different clusters were calculated for each disease and were used to determine the percentage of diseases with term hits, defined by(7)with being an indicator function taking value when cluster has a term hit and otherwise. Interestingly, the number of such diseases showed an increase from 60% (when enrichment was directly performed for the diseases) to 85%. For the diseases that had term hits using both methods (direct and through clustering) the term similarity , was calculated using(8)with being the number of diseases that have significant term hits, being the set of terms associated with the th disease, being the set of those assigned to the cluster , and denoting the number of members in the set . We found . This seems to indicate that more than of the terms associated with the diseases were dropped upon merging to clusters and some information might have been lost in the process. What is really important, however, is whether terms of small number of annotated genes are preserved, as these terms are most specific and usually most informative. Upon examining the distribution of minimum GO/KEGG term size (number of annotated genes for that term) when running SaddleSum using diseases directly and using cluster centers, we find that the most informative terms are largely kept in the process. The distribution of the minimum term size is shown in Fig. 4.

Bottom Line: We have also compared our results to those of MimMiner, a text-mining method that assigns pairwise similarity scores to diseases.We find the results of the two methods to be complementary.Although not needed for understanding this paper, the raw results are available for download for further study at ftp://ftp.ncbi.nlm.nih.gov/pub/qmbpmn/DiseaseRelations/.

View Article: PubMed Central - PubMed

Affiliation: National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, United States of America.

ABSTRACT
Identifying similar diseases could potentially provide deeper understanding of their underlying causes, and may even hint at possible treatments. For this purpose, it is necessary to have a similarity measure that reflects the underpinning molecular interactions and biological pathways. We have thus devised a network-based measure that can partially fulfill this goal. Our method assigns weights to all proteins (and consequently their encoding genes) by using information flow from a disease to the protein interaction network and back. Similarity between two diseases is then defined as the cosine of the angle between their corresponding weight vectors. The proposed method also provides a way to suggest disease-pathway associations by using the weights assigned to the genes to perform enrichment analysis for each disease. By calculating pairwise similarities between 2534 diseases, we show that our disease similarity measure is strongly correlated with the probability of finding the diseases in the same disease family and, more importantly, sharing biological pathways. We have also compared our results to those of MimMiner, a text-mining method that assigns pairwise similarity scores to diseases. We find the results of the two methods to be complementary. It is also shown that clustering diseases based on their similarities and performing enrichment analysis for the cluster centers significantly increases the term association rate, suggesting that the cluster centers are better representatives for biological pathways than the diseases themselves. This lends support to the view that our similarity measure is a good indicator of relatedness of biological processes involved in causing the diseases. Although not needed for understanding this paper, the raw results are available for download for further study at ftp://ftp.ncbi.nlm.nih.gov/pub/qmbpmn/DiseaseRelations/.

Show MeSH
Related in: MedlinePlus