Limits...
Data Mining and Pattern Recognition Models for Identifying Inherited Diseases: Challenges and Implications

View Article: PubMed Central - PubMed

ABSTRACT

Data mining and pattern recognition methods reveal interesting findings in genetic studies, especially on how the genetic makeup is associated with inherited diseases. Although researchers have proposed various data mining models for biomedical approaches, there remains a challenge in accurately prioritizing the single nucleotide polymorphisms (SNP) associated with the disease. In this commentary, we review the state-of-art data mining and pattern recognition models for identifying inherited diseases and deliberate the need of binary classification- and scoring-based prioritization methods in determining causal variants. While we discuss the pros and cons associated with these methods known, we argue that the gene prioritization methods and the protein interaction (PPI) methods in conjunction with the K nearest neighbors' could be used in accurately categorizing the genetic factors in disease causation.

No MeSH data available.


Gene prioritization methods. (A) (a) Entities and their relations are retrieved from the source databases which are then filtered, sorted and disambiguated to create the graph. From gene expression studies or positional studies or GO, a list of genes are obtained for prioritization (Arrais and Oliveira, 2010) (b) Calculation of the gene semantic similarity scores to obtain the three gene similarity profiles; (B) Combining the three gene similarity profiles and the phenotype similarity profile to calculate three dimensional features for classification; (C) Training classifier using the known associations and sufficient unrelated gene-disease pairs as training sets; (D) Prediction of disease-gene associations; (E) Prioritization of candidate genes (He and Jiang, 2012).
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4979376&req=5

Figure 4: Gene prioritization methods. (A) (a) Entities and their relations are retrieved from the source databases which are then filtered, sorted and disambiguated to create the graph. From gene expression studies or positional studies or GO, a list of genes are obtained for prioritization (Arrais and Oliveira, 2010) (b) Calculation of the gene semantic similarity scores to obtain the three gene similarity profiles; (B) Combining the three gene similarity profiles and the phenotype similarity profile to calculate three dimensional features for classification; (C) Training classifier using the known associations and sufficient unrelated gene-disease pairs as training sets; (D) Prediction of disease-gene associations; (E) Prioritization of candidate genes (He and Jiang, 2012).

Mentions: The availability of human genome-wide PPI data has opened a wide outlook for discovering inherited disease genes by studying topological features in PPI networks (Xu and Li, 2006). Keeping in view of the fact that studies on proteins and their interactions are important to understand their dynamic roles for identification of inherited, and to a certain extent rare immunological disorders, mapping the disease specific genes (for example genes related to schizophrenia) into the whole human interaction network and then the extraction of related sub networks can throw light on the cellular mechanisms and biological processes related to the inherited disease (Jingchun et al., 2008). The sequence based features have previously been exploited and found that in many cases there are significant differences between genes responsible for human hereditary disease and those not known to be involved in diseases (Xu and Li, 2006). Genes associated with a human disease preferentially interacted with other disease-causing genes, suggesting that heritable disease-genes might share some topological features in the PPIs network when compared to the non-diseased genes (Gandhi et al., 2006). We further argue that similar genes are obtained with human PPI datasets from Online Predicted Human Integration Database (OPID; Brown and Jurisica, 2005) employing K-nearest neighbor (KNN) algorithm for classification. The KNN algorithm is a simple and yet a powerful non parametric classification algorithm (Franke et al., 2006) with an effective performance. With rapid enhancement in quantity and quality of human interaction and phenotypic data, the performance and utility of this approach to detect novel disease-genes should improve further as we come to the end of the post-genomic era. We have earlier proposed a classification scoring method to validate the interaction mapping between such proteins and calculated the total reliability score (TRS) using machine learning algorithms (Suravajhala and Sundararajan, 2012). The accuracy of six point classification model was found to be 81.08% on multilayer perception of neural network which, if used based on such approaches, could have identified causal SNPs toward development of molecular markers. In the recent-past, the PPI database is also used as a knowledge base with a set of known disease-related genes that is utilized with linkage analysis in prioritizing the best candidates. From our previous discussions, we argue that the main advantage of the probabilistic knowledge model is that it reduces the prioritization error by 6% when compared to already published methods. These studies based on the relatedness to known diseases or closely related disease processes, however, remain a challenge in prioritizing loner or isolated genes with no known relationships between two nodes in the network. The solution for prioritizing loosely connected disease genes with other genes has previously been proposed by Fang et al. (Fang et al., 2014) (Figure 4) which uses network diffusion and rank concordance (NDRC). In addition, they found that genes related to complex diseases are divided into several modules associated with different disease phenotypes. First, they built the network without removing the insignificant genes from the network, while the second one was based on the Diffusion Rank (DR) algorithm (Yang et al., 2007). The NDRC simulates the heat diffusion process where information flows from the known disease genes of related disease and propagates over the PPI network with noisy data as a problem to prioritize disease genes (Li et al., 2009; Wang et al., 2009; Fang et al., 2014). Multiple kernels learning (KML) and N dimensional order statistic (NDOS) methods were found to be handling noisy data effectively. Two strategies are relatively known, one to search for a kernel that would best represent all the information available using a convex optimization method, known as semi-definite programming (SDP; Lanckriet et al., 2004; De Bie et al., 2007) which simultaneously optimizes parameters for one-class SVM and tests genes ranked by the one-class SVM. In the second strategy, all test genes were first ranked using one-class SVM with an individual data source and then, N-dimensional order statistics was used to combine these rank lists into one rank list (Lanckriet et al., 2004). For gene ontology and sequence data, we argue that the kernel with higher weight may not have much influence in discerning the SNPs. Different machine learning algorithms as described are used for prioritizing SNPs and proven to be effective especially SVMs with higher performance than the random forest. A comparative study was carried out by Tranchevent et al. (2010) on prioritizing tools. The authors recommended genes to diseases (G2D) as a good tool for providing an ordered list of candidate genes in the peak regions (Perez-Iratxeta et al., 2002). In conclusion, SVM can be applied to the gene prioritizing process with ontology association as it encompasses enrichment based feature selection processes in identification of inherited diseases (He and Jiang, 2012; Xie et al., 2012). Nevertheless, feature selection method and integrating more data sources like semantic and phenotype similarity profiling between diseases and genes could build better prediction ability.


Data Mining and Pattern Recognition Models for Identifying Inherited Diseases: Challenges and Implications
Gene prioritization methods. (A) (a) Entities and their relations are retrieved from the source databases which are then filtered, sorted and disambiguated to create the graph. From gene expression studies or positional studies or GO, a list of genes are obtained for prioritization (Arrais and Oliveira, 2010) (b) Calculation of the gene semantic similarity scores to obtain the three gene similarity profiles; (B) Combining the three gene similarity profiles and the phenotype similarity profile to calculate three dimensional features for classification; (C) Training classifier using the known associations and sufficient unrelated gene-disease pairs as training sets; (D) Prediction of disease-gene associations; (E) Prioritization of candidate genes (He and Jiang, 2012).
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4979376&req=5

Figure 4: Gene prioritization methods. (A) (a) Entities and their relations are retrieved from the source databases which are then filtered, sorted and disambiguated to create the graph. From gene expression studies or positional studies or GO, a list of genes are obtained for prioritization (Arrais and Oliveira, 2010) (b) Calculation of the gene semantic similarity scores to obtain the three gene similarity profiles; (B) Combining the three gene similarity profiles and the phenotype similarity profile to calculate three dimensional features for classification; (C) Training classifier using the known associations and sufficient unrelated gene-disease pairs as training sets; (D) Prediction of disease-gene associations; (E) Prioritization of candidate genes (He and Jiang, 2012).
Mentions: The availability of human genome-wide PPI data has opened a wide outlook for discovering inherited disease genes by studying topological features in PPI networks (Xu and Li, 2006). Keeping in view of the fact that studies on proteins and their interactions are important to understand their dynamic roles for identification of inherited, and to a certain extent rare immunological disorders, mapping the disease specific genes (for example genes related to schizophrenia) into the whole human interaction network and then the extraction of related sub networks can throw light on the cellular mechanisms and biological processes related to the inherited disease (Jingchun et al., 2008). The sequence based features have previously been exploited and found that in many cases there are significant differences between genes responsible for human hereditary disease and those not known to be involved in diseases (Xu and Li, 2006). Genes associated with a human disease preferentially interacted with other disease-causing genes, suggesting that heritable disease-genes might share some topological features in the PPIs network when compared to the non-diseased genes (Gandhi et al., 2006). We further argue that similar genes are obtained with human PPI datasets from Online Predicted Human Integration Database (OPID; Brown and Jurisica, 2005) employing K-nearest neighbor (KNN) algorithm for classification. The KNN algorithm is a simple and yet a powerful non parametric classification algorithm (Franke et al., 2006) with an effective performance. With rapid enhancement in quantity and quality of human interaction and phenotypic data, the performance and utility of this approach to detect novel disease-genes should improve further as we come to the end of the post-genomic era. We have earlier proposed a classification scoring method to validate the interaction mapping between such proteins and calculated the total reliability score (TRS) using machine learning algorithms (Suravajhala and Sundararajan, 2012). The accuracy of six point classification model was found to be 81.08% on multilayer perception of neural network which, if used based on such approaches, could have identified causal SNPs toward development of molecular markers. In the recent-past, the PPI database is also used as a knowledge base with a set of known disease-related genes that is utilized with linkage analysis in prioritizing the best candidates. From our previous discussions, we argue that the main advantage of the probabilistic knowledge model is that it reduces the prioritization error by 6% when compared to already published methods. These studies based on the relatedness to known diseases or closely related disease processes, however, remain a challenge in prioritizing loner or isolated genes with no known relationships between two nodes in the network. The solution for prioritizing loosely connected disease genes with other genes has previously been proposed by Fang et al. (Fang et al., 2014) (Figure 4) which uses network diffusion and rank concordance (NDRC). In addition, they found that genes related to complex diseases are divided into several modules associated with different disease phenotypes. First, they built the network without removing the insignificant genes from the network, while the second one was based on the Diffusion Rank (DR) algorithm (Yang et al., 2007). The NDRC simulates the heat diffusion process where information flows from the known disease genes of related disease and propagates over the PPI network with noisy data as a problem to prioritize disease genes (Li et al., 2009; Wang et al., 2009; Fang et al., 2014). Multiple kernels learning (KML) and N dimensional order statistic (NDOS) methods were found to be handling noisy data effectively. Two strategies are relatively known, one to search for a kernel that would best represent all the information available using a convex optimization method, known as semi-definite programming (SDP; Lanckriet et al., 2004; De Bie et al., 2007) which simultaneously optimizes parameters for one-class SVM and tests genes ranked by the one-class SVM. In the second strategy, all test genes were first ranked using one-class SVM with an individual data source and then, N-dimensional order statistics was used to combine these rank lists into one rank list (Lanckriet et al., 2004). For gene ontology and sequence data, we argue that the kernel with higher weight may not have much influence in discerning the SNPs. Different machine learning algorithms as described are used for prioritizing SNPs and proven to be effective especially SVMs with higher performance than the random forest. A comparative study was carried out by Tranchevent et al. (2010) on prioritizing tools. The authors recommended genes to diseases (G2D) as a good tool for providing an ordered list of candidate genes in the peak regions (Perez-Iratxeta et al., 2002). In conclusion, SVM can be applied to the gene prioritizing process with ontology association as it encompasses enrichment based feature selection processes in identification of inherited diseases (He and Jiang, 2012; Xie et al., 2012). Nevertheless, feature selection method and integrating more data sources like semantic and phenotype similarity profiling between diseases and genes could build better prediction ability.

View Article: PubMed Central - PubMed

ABSTRACT

Data mining and pattern recognition methods reveal interesting findings in genetic studies, especially on how the genetic makeup is associated with inherited diseases. Although researchers have proposed various data mining models for biomedical approaches, there remains a challenge in accurately prioritizing the single nucleotide polymorphisms (SNP) associated with the disease. In this commentary, we review the state-of-art data mining and pattern recognition models for identifying inherited diseases and deliberate the need of binary classification- and scoring-based prioritization methods in determining causal variants. While we discuss the pros and cons associated with these methods known, we argue that the gene prioritization methods and the protein interaction (PPI) methods in conjunction with the K nearest neighbors' could be used in accurately categorizing the genetic factors in disease causation.

No MeSH data available.