Limits...
A computational approach to candidate gene prioritization for X-linked mental retardation using annotation-based binary filtering and motif-based linear discriminatory analysis.

Lombard Z, Park C, Makova KD, Ramsay M - Biol. Direct (2011)

Bottom Line: In parallel, a motif finding approach based on linear discriminatory analysis (LDA) was employed to identify short sequence patterns that may discriminate XLMR from non-XLMR genes.High rates (>80%) of correct classification was achieved, suggesting that the identification of these motifs effectively captures genomic signals associated with XLMR vs. non-XLMR genes.The combination of gene annotation information and sequence motif-orientated computational candidate gene prediction methods highlight an added benefit in generating a list of plausible candidate genes, as has been demonstrated for XLMR.

View Article: PubMed Central - HTML - PubMed

Affiliation: Division of Human Genetics, School of Pathology, Faculty of Health Sciences, National Health Laboratory Service & University of the Witwatersrand, Johannesburg, 2000, South Africa. zane.lombard@nhls.ac.za

ABSTRACT

Background: Several computational candidate gene selection and prioritization methods have recently been developed. These in silico selection and prioritization techniques are usually based on two central approaches--the examination of similarities to known disease genes and/or the evaluation of functional annotation of genes. Each of these approaches has its own caveats. Here we employ a previously described method of candidate gene prioritization based mainly on gene annotation, in accompaniment with a technique based on the evaluation of pertinent sequence motifs or signatures, in an attempt to refine the gene prioritization approach. We apply this approach to X-linked mental retardation (XLMR), a group of heterogeneous disorders for which some of the underlying genetics is known.

Results: The gene annotation-based binary filtering method yielded a ranked list of putative XLMR candidate genes with good plausibility of being associated with the development of mental retardation. In parallel, a motif finding approach based on linear discriminatory analysis (LDA) was employed to identify short sequence patterns that may discriminate XLMR from non-XLMR genes. High rates (>80%) of correct classification was achieved, suggesting that the identification of these motifs effectively captures genomic signals associated with XLMR vs. non-XLMR genes. The computational tools developed for the motif-based LDA is integrated into the freely available genomic analysis portal Galaxy (http://main.g2.bx.psu.edu/). Nine genes (APLN, ZC4H2, MAGED4, MAGED4B, RAP2C, FAM156A, FAM156B, TBL1X, and UXT) were highlighted as highly-ranked XLMR methods.

Conclusions: The combination of gene annotation information and sequence motif-orientated computational candidate gene prediction methods highlight an added benefit in generating a list of plausible candidate genes, as has been demonstrated for XLMR.

Show MeSH

Related in: MedlinePlus

LDA classification success rates for different values of the tuning parameter. (A) All XAR genes were used for training and test sets. (B) All XCR genes were used for training and test sets. Leave-one-out cross-validation was utilized to calculate correct classification rates. Dots indicate optimal values of τ. More detailed information is given in Table 3.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3142252&req=5

Figure 2: LDA classification success rates for different values of the tuning parameter. (A) All XAR genes were used for training and test sets. (B) All XCR genes were used for training and test sets. Leave-one-out cross-validation was utilized to calculate correct classification rates. Dots indicate optimal values of τ. More detailed information is given in Table 3.

Mentions: Statistical analyses were performed to predict whether, using the set of overrepresented oligomers, genes could be classified as putative XLMR genes or non-XLMR genes. Linear discriminant analysis (LDA [45]) was used as a classifier to distinguish between XLMR and non-XLMR genes, considering XAR and XCR genes separately. Genes and counts of overrepresented oligomers upstream from TSSs of genes, identified above, were used as the units and features for classification. Utilizing two distinct sets of training data (either genes from the XAR or XCR), both LDA classifiers achieved classification accuracy of greater than 80% for both XLMR and non-XLMR classes except for the 10 kb range (Table 3). The high rates of correct classification imply that classifiers based on counts of overrepresented oligomers effectively capture genomic signals found near XLMR vs. non-XLMR genes. In the case of the 100 kb range, correct classification rates for both classifiers were ≥ 96% for XLMR and non-XLMR classes (Figure 2 and Table 3). All genes tested in XAR with the 100 kb distance were perfectly classified (Table S7 - Additional File 1). For genes in XCR with the 100 kb distance, two genes (out of seven) in the XLMR class and three genes (out of 75) in non-XLMR class were incorrectly classified. Among these three incorrectly classified genes, one gene (STARD8) was consistently misclassified an XLMR gene at 10 kb and 50 kb distances as well, indicating that this gene may be a strong candidate XLMR gene. The other two genes that were misclassified with the 100 kb distance - TRO and NUDT10 - were not consistently misclassified across all three distances (Table S7), which lowers our confidence in them being true candidates. In addition to this there were a further seven non-XLMR genes that were incorrectly classified as XLMR-genes at both 10 kb and 50 kb distances (ESX1, FGF16, ZC4H2, MAMLD1, ODZ1, PAGE4 and TMEM28).


A computational approach to candidate gene prioritization for X-linked mental retardation using annotation-based binary filtering and motif-based linear discriminatory analysis.

Lombard Z, Park C, Makova KD, Ramsay M - Biol. Direct (2011)

LDA classification success rates for different values of the tuning parameter. (A) All XAR genes were used for training and test sets. (B) All XCR genes were used for training and test sets. Leave-one-out cross-validation was utilized to calculate correct classification rates. Dots indicate optimal values of τ. More detailed information is given in Table 3.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3142252&req=5

Figure 2: LDA classification success rates for different values of the tuning parameter. (A) All XAR genes were used for training and test sets. (B) All XCR genes were used for training and test sets. Leave-one-out cross-validation was utilized to calculate correct classification rates. Dots indicate optimal values of τ. More detailed information is given in Table 3.
Mentions: Statistical analyses were performed to predict whether, using the set of overrepresented oligomers, genes could be classified as putative XLMR genes or non-XLMR genes. Linear discriminant analysis (LDA [45]) was used as a classifier to distinguish between XLMR and non-XLMR genes, considering XAR and XCR genes separately. Genes and counts of overrepresented oligomers upstream from TSSs of genes, identified above, were used as the units and features for classification. Utilizing two distinct sets of training data (either genes from the XAR or XCR), both LDA classifiers achieved classification accuracy of greater than 80% for both XLMR and non-XLMR classes except for the 10 kb range (Table 3). The high rates of correct classification imply that classifiers based on counts of overrepresented oligomers effectively capture genomic signals found near XLMR vs. non-XLMR genes. In the case of the 100 kb range, correct classification rates for both classifiers were ≥ 96% for XLMR and non-XLMR classes (Figure 2 and Table 3). All genes tested in XAR with the 100 kb distance were perfectly classified (Table S7 - Additional File 1). For genes in XCR with the 100 kb distance, two genes (out of seven) in the XLMR class and three genes (out of 75) in non-XLMR class were incorrectly classified. Among these three incorrectly classified genes, one gene (STARD8) was consistently misclassified an XLMR gene at 10 kb and 50 kb distances as well, indicating that this gene may be a strong candidate XLMR gene. The other two genes that were misclassified with the 100 kb distance - TRO and NUDT10 - were not consistently misclassified across all three distances (Table S7), which lowers our confidence in them being true candidates. In addition to this there were a further seven non-XLMR genes that were incorrectly classified as XLMR-genes at both 10 kb and 50 kb distances (ESX1, FGF16, ZC4H2, MAMLD1, ODZ1, PAGE4 and TMEM28).

Bottom Line: In parallel, a motif finding approach based on linear discriminatory analysis (LDA) was employed to identify short sequence patterns that may discriminate XLMR from non-XLMR genes.High rates (>80%) of correct classification was achieved, suggesting that the identification of these motifs effectively captures genomic signals associated with XLMR vs. non-XLMR genes.The combination of gene annotation information and sequence motif-orientated computational candidate gene prediction methods highlight an added benefit in generating a list of plausible candidate genes, as has been demonstrated for XLMR.

View Article: PubMed Central - HTML - PubMed

Affiliation: Division of Human Genetics, School of Pathology, Faculty of Health Sciences, National Health Laboratory Service & University of the Witwatersrand, Johannesburg, 2000, South Africa. zane.lombard@nhls.ac.za

ABSTRACT

Background: Several computational candidate gene selection and prioritization methods have recently been developed. These in silico selection and prioritization techniques are usually based on two central approaches--the examination of similarities to known disease genes and/or the evaluation of functional annotation of genes. Each of these approaches has its own caveats. Here we employ a previously described method of candidate gene prioritization based mainly on gene annotation, in accompaniment with a technique based on the evaluation of pertinent sequence motifs or signatures, in an attempt to refine the gene prioritization approach. We apply this approach to X-linked mental retardation (XLMR), a group of heterogeneous disorders for which some of the underlying genetics is known.

Results: The gene annotation-based binary filtering method yielded a ranked list of putative XLMR candidate genes with good plausibility of being associated with the development of mental retardation. In parallel, a motif finding approach based on linear discriminatory analysis (LDA) was employed to identify short sequence patterns that may discriminate XLMR from non-XLMR genes. High rates (>80%) of correct classification was achieved, suggesting that the identification of these motifs effectively captures genomic signals associated with XLMR vs. non-XLMR genes. The computational tools developed for the motif-based LDA is integrated into the freely available genomic analysis portal Galaxy (http://main.g2.bx.psu.edu/). Nine genes (APLN, ZC4H2, MAGED4, MAGED4B, RAP2C, FAM156A, FAM156B, TBL1X, and UXT) were highlighted as highly-ranked XLMR methods.

Conclusions: The combination of gene annotation information and sequence motif-orientated computational candidate gene prediction methods highlight an added benefit in generating a list of plausible candidate genes, as has been demonstrated for XLMR.

Show MeSH
Related in: MedlinePlus