Limits...
Heterogeneous Network Edge Prediction: A Data Integration Approach to Prioritize Disease-Associated Genes.

Himmelstein DS, Baranzini SE - PLoS Comput. Biol. (2015)

Bottom Line: The first decade of Genome Wide Association Studies (GWAS) has uncovered a wealth of disease-associated variants.The model, which achieved 132-fold enrichment in precision at 10% recall, outperformed any individual domain, highlighting the benefit of integrative approaches.Our method successfully predicted the results (with AUROC = 0.79) from a withheld multiple sclerosis (MS) GWAS despite starting with only 13 previously associated genes.

View Article: PubMed Central - PubMed

Affiliation: Biological & Medical Informatics, University of California, San Francisco, San Francisco, California, United States of America.

ABSTRACT
The first decade of Genome Wide Association Studies (GWAS) has uncovered a wealth of disease-associated variants. Two important derivations will be the translation of this information into a multiscale understanding of pathogenic variants and leveraging existing data to increase the power of existing and future studies through prioritization. We explore edge prediction on heterogeneous networks--graphs with multiple node and edge types--for accomplishing both tasks. First we constructed a network with 18 node types--genes, diseases, tissues, pathophysiologies, and 14 MSigDB (molecular signatures database) collections--and 19 edge types from high-throughput publicly-available resources. From this network composed of 40,343 nodes and 1,608,168 edges, we extracted features that describe the topology between specific genes and diseases. Next, we trained a model from GWAS associations and predicted the probability of association between each protein-coding gene and each of 29 well-studied complex diseases. The model, which achieved 132-fold enrichment in precision at 10% recall, outperformed any individual domain, highlighting the benefit of integrative approaches. We identified pleiotropy, transcriptional signatures of perturbations, pathways, and protein interactions as influential mechanisms explaining pathogenesis. Our method successfully predicted the results (with AUROC = 0.79) from a withheld multiple sclerosis (MS) GWAS despite starting with only 13 previously associated genes. Finally, we combined our network predictions with statistical evidence of association to propose four novel MS genes, three of which (JAK2, REL, RUNX3) validated on the masked GWAS. Furthermore, our predictions provide biological support highlighting REL as the causal gene within its gene-rich locus. Users can browse all predictions online (http://het.io). Heterogeneous network edge prediction effectively prioritized genetic associations and provides a powerful new approach for data integration across multiple domains.

No MeSH data available.


Related in: MedlinePlus

Heterogeneous network edge prediction methodology.A) We constructed the network according to a schema, called a metagraph, which is composed of metanodes (node types) and metaedges (edge types). B) The network topology connecting a gene and disease node is measured along metapaths (types of paths). Starting on Gene and ending on Disease, all metapaths length three or less are computed by traversing the metagraph. C) A hypothetical graph subset showing select nodes and edges surrounding IRF1 and multiple sclerosis. To characterize this relationship, features are computed that measure the prevalence of a specific metapath between IRF1 and multiple sclerosis. D) Two features (for the GeTlD and GiGaD metapaths) are calculated to describe the relationship between IRF1 and multiple sclerosis. The metric underlying the features is degree-weighted path count (DWPC). First, for the specified metapath, all paths are extracted from the network. Next, each path receives a path-degree product (PDP) measuring its specificity (calculated from node-degrees along the path, Dpath). This step requires a damping exponent (here w = 0.5), which adjusts how severely high-degree paths are downweighted. Finally, the path-degree products are summed to produce the DWPC.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4497619&req=5

pcbi.1004259.g002: Heterogeneous network edge prediction methodology.A) We constructed the network according to a schema, called a metagraph, which is composed of metanodes (node types) and metaedges (edge types). B) The network topology connecting a gene and disease node is measured along metapaths (types of paths). Starting on Gene and ending on Disease, all metapaths length three or less are computed by traversing the metagraph. C) A hypothetical graph subset showing select nodes and edges surrounding IRF1 and multiple sclerosis. To characterize this relationship, features are computed that measure the prevalence of a specific metapath between IRF1 and multiple sclerosis. D) Two features (for the GeTlD and GiGaD metapaths) are calculated to describe the relationship between IRF1 and multiple sclerosis. The metric underlying the features is degree-weighted path count (DWPC). First, for the specified metapath, all paths are extracted from the network. Next, each path receives a path-degree product (PDP) measuring its specificity (calculated from node-degrees along the path, Dpath). This step requires a damping exponent (here w = 0.5), which adjusts how severely high-degree paths are downweighted. Finally, the path-degree products are summed to produce the DWPC.

Mentions: Using publicly-available databases and standardized vocabularies, we constructed a heterogeneous network with 40,343 nodes and 1,608,168 edges (Fig 1, S2 Data). Databases were selected based on quality, reusability, throughput, and their aggregate ability to provide a diversified, multiscale portrayal of biology (see Resource selection in Methods). The network was designed to encode entities and relationships relevant to pathogenesis. The network contained 18 node types (metanodes) and 19 edge types (metaedges), displayed in Fig 2A. Entities represented by metanodes consisted of diseases, genes, tissues, pathophysiologies, and gene sets for 14 MSigDB collections [36,37] including pathways [38,39], perturbation signatures, motifs [40,41], and Gene Ontology (GO) domains [42] (Table 1). Relationships represented by metaedges consisted of gene-disease association, disease pathophysiology, disease localization, tissue-specific gene expression, protein interaction, and gene-set membership for each MSigDB collection (Table 2).


Heterogeneous Network Edge Prediction: A Data Integration Approach to Prioritize Disease-Associated Genes.

Himmelstein DS, Baranzini SE - PLoS Comput. Biol. (2015)

Heterogeneous network edge prediction methodology.A) We constructed the network according to a schema, called a metagraph, which is composed of metanodes (node types) and metaedges (edge types). B) The network topology connecting a gene and disease node is measured along metapaths (types of paths). Starting on Gene and ending on Disease, all metapaths length three or less are computed by traversing the metagraph. C) A hypothetical graph subset showing select nodes and edges surrounding IRF1 and multiple sclerosis. To characterize this relationship, features are computed that measure the prevalence of a specific metapath between IRF1 and multiple sclerosis. D) Two features (for the GeTlD and GiGaD metapaths) are calculated to describe the relationship between IRF1 and multiple sclerosis. The metric underlying the features is degree-weighted path count (DWPC). First, for the specified metapath, all paths are extracted from the network. Next, each path receives a path-degree product (PDP) measuring its specificity (calculated from node-degrees along the path, Dpath). This step requires a damping exponent (here w = 0.5), which adjusts how severely high-degree paths are downweighted. Finally, the path-degree products are summed to produce the DWPC.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4497619&req=5

pcbi.1004259.g002: Heterogeneous network edge prediction methodology.A) We constructed the network according to a schema, called a metagraph, which is composed of metanodes (node types) and metaedges (edge types). B) The network topology connecting a gene and disease node is measured along metapaths (types of paths). Starting on Gene and ending on Disease, all metapaths length three or less are computed by traversing the metagraph. C) A hypothetical graph subset showing select nodes and edges surrounding IRF1 and multiple sclerosis. To characterize this relationship, features are computed that measure the prevalence of a specific metapath between IRF1 and multiple sclerosis. D) Two features (for the GeTlD and GiGaD metapaths) are calculated to describe the relationship between IRF1 and multiple sclerosis. The metric underlying the features is degree-weighted path count (DWPC). First, for the specified metapath, all paths are extracted from the network. Next, each path receives a path-degree product (PDP) measuring its specificity (calculated from node-degrees along the path, Dpath). This step requires a damping exponent (here w = 0.5), which adjusts how severely high-degree paths are downweighted. Finally, the path-degree products are summed to produce the DWPC.
Mentions: Using publicly-available databases and standardized vocabularies, we constructed a heterogeneous network with 40,343 nodes and 1,608,168 edges (Fig 1, S2 Data). Databases were selected based on quality, reusability, throughput, and their aggregate ability to provide a diversified, multiscale portrayal of biology (see Resource selection in Methods). The network was designed to encode entities and relationships relevant to pathogenesis. The network contained 18 node types (metanodes) and 19 edge types (metaedges), displayed in Fig 2A. Entities represented by metanodes consisted of diseases, genes, tissues, pathophysiologies, and gene sets for 14 MSigDB collections [36,37] including pathways [38,39], perturbation signatures, motifs [40,41], and Gene Ontology (GO) domains [42] (Table 1). Relationships represented by metaedges consisted of gene-disease association, disease pathophysiology, disease localization, tissue-specific gene expression, protein interaction, and gene-set membership for each MSigDB collection (Table 2).

Bottom Line: The first decade of Genome Wide Association Studies (GWAS) has uncovered a wealth of disease-associated variants.The model, which achieved 132-fold enrichment in precision at 10% recall, outperformed any individual domain, highlighting the benefit of integrative approaches.Our method successfully predicted the results (with AUROC = 0.79) from a withheld multiple sclerosis (MS) GWAS despite starting with only 13 previously associated genes.

View Article: PubMed Central - PubMed

Affiliation: Biological & Medical Informatics, University of California, San Francisco, San Francisco, California, United States of America.

ABSTRACT
The first decade of Genome Wide Association Studies (GWAS) has uncovered a wealth of disease-associated variants. Two important derivations will be the translation of this information into a multiscale understanding of pathogenic variants and leveraging existing data to increase the power of existing and future studies through prioritization. We explore edge prediction on heterogeneous networks--graphs with multiple node and edge types--for accomplishing both tasks. First we constructed a network with 18 node types--genes, diseases, tissues, pathophysiologies, and 14 MSigDB (molecular signatures database) collections--and 19 edge types from high-throughput publicly-available resources. From this network composed of 40,343 nodes and 1,608,168 edges, we extracted features that describe the topology between specific genes and diseases. Next, we trained a model from GWAS associations and predicted the probability of association between each protein-coding gene and each of 29 well-studied complex diseases. The model, which achieved 132-fold enrichment in precision at 10% recall, outperformed any individual domain, highlighting the benefit of integrative approaches. We identified pleiotropy, transcriptional signatures of perturbations, pathways, and protein interactions as influential mechanisms explaining pathogenesis. Our method successfully predicted the results (with AUROC = 0.79) from a withheld multiple sclerosis (MS) GWAS despite starting with only 13 previously associated genes. Finally, we combined our network predictions with statistical evidence of association to propose four novel MS genes, three of which (JAK2, REL, RUNX3) validated on the masked GWAS. Furthermore, our predictions provide biological support highlighting REL as the causal gene within its gene-rich locus. Users can browse all predictions online (http://het.io). Heterogeneous network edge prediction effectively prioritized genetic associations and provides a powerful new approach for data integration across multiple domains.

No MeSH data available.


Related in: MedlinePlus