Limits...
In silico discovery of transcription regulatory elements in Plasmodium falciparum.

Young JA, Johnson JR, Benner C, Yan SF, Chen K, Le Roch KG, Zhou Y, Winzeler EA - BMC Genomics (2008)

Bottom Line: When applied to promoter regions of genes contained within 21 co-expression gene clusters generated from P. falciparum life cycle microarray data using the semi-supervised clustering algorithm Ontology-based Pattern Identification, GEMS identified 34 putative cis-regulatory elements associated with a variety of parasite processes including sexual development, cell invasion, antigenic variation and protein biosynthesis.This GEMS analysis demonstrates that in silico regulatory element discovery can be successfully applied to challenging repeat-sequence-rich, base-biased genomes such as that of P. falciparum.The putative regulatory elements described represent promising candidates for future biological investigation into the underlying transcriptional control mechanisms of gene regulation in malaria parasites.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Cell Biology, ICND 202, The Scripps Research Institute, 10550 North Torrey Pines Road, La Jolla, CA 92037, USA. jayoung@scripps.edu

ABSTRACT

Background: With the sequence of the Plasmodium falciparum genome and several global mRNA and protein life cycle expression profiling projects now completed, elucidating the underlying networks of transcriptional control important for the progression of the parasite life cycle is highly pertinent to the development of new anti-malarials. To date, relatively little is known regarding the specific mechanisms the parasite employs to regulate gene expression at the mRNA level, with studies of the P. falciparum genome sequence having revealed few cis-regulatory elements and associated transcription factors. Although it is possible the parasite may evoke mechanisms of transcriptional control drastically different from those used by other eukaryotic organisms, the extreme AT-rich nature of P. falciparum intergenic regions (approximately 90% AT) presents significant challenges to in silico cis-regulatory element discovery.

Results: We have developed an algorithm called Gene Enrichment Motif Searching (GEMS) that uses a hypergeometric-based scoring function and a position-weight matrix optimization routine to identify with high-confidence regulatory elements in the nucleotide-biased and repeat sequence-rich P. falciparum genome. When applied to promoter regions of genes contained within 21 co-expression gene clusters generated from P. falciparum life cycle microarray data using the semi-supervised clustering algorithm Ontology-based Pattern Identification, GEMS identified 34 putative cis-regulatory elements associated with a variety of parasite processes including sexual development, cell invasion, antigenic variation and protein biosynthesis. Among these candidates were novel motifs, as well as many of the elements for which biological experimental evidence already exists in the Plasmodium literature. To provide evidence for the biological relevance of a cell invasion-related element predicted by GEMS, reporter gene and electrophoretic mobility shift assays were conducted.

Conclusion: This GEMS analysis demonstrates that in silico regulatory element discovery can be successfully applied to challenging repeat-sequence-rich, base-biased genomes such as that of P. falciparum. The fact that regulatory elements were predicted from a diverse range of functional gene clusters supports the hypothesis that cis-regulatory elements play a role in the transcriptional control of many P. falciparum biological processes. The putative regulatory elements described represent promising candidates for future biological investigation into the underlying transcriptional control mechanisms of gene regulation in malaria parasites.

Show MeSH

Related in: MedlinePlus

GEMS and MDScan random permutation analyses. a) log10P enrichment score distribution of motif candidates derived from GEMS analysis of the upstream regions of genes contained within Sexual Development cluster (GO:GNF0004) (red). For comparison, the log10P enrichment score distribution of motif candidates derived from GEMS analysis of 100 randomly selected sets of promoter sequences are also plotted (blue) representing the p-value range that is obtainable by chance, i.e. potential false positives. log10P values for the top 20 motifs from GEMS analysis of the Sexual Development cluster all fall below those obtainable by random simulations. b) MAP score distribution of motif candidates derived from MDScan analysis of the Sexual Development cluster (GO:GNF0004) (red). Again, for comparison, the MAP score distribution of motif candidates derived from MDScan analysis of 100 randomly selected sets of promoter sequences are also plotted (blue) representing the MAP score range that is obtainable by chance. MAP scores for motifs obtained using MDScan analysis of the Sexual Development cluster do not distance themselves from those obtained in random simulations suggesting the potential for many false positives in MDScan motif discovery.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2268928&req=5

Figure 3: GEMS and MDScan random permutation analyses. a) log10P enrichment score distribution of motif candidates derived from GEMS analysis of the upstream regions of genes contained within Sexual Development cluster (GO:GNF0004) (red). For comparison, the log10P enrichment score distribution of motif candidates derived from GEMS analysis of 100 randomly selected sets of promoter sequences are also plotted (blue) representing the p-value range that is obtainable by chance, i.e. potential false positives. log10P values for the top 20 motifs from GEMS analysis of the Sexual Development cluster all fall below those obtainable by random simulations. b) MAP score distribution of motif candidates derived from MDScan analysis of the Sexual Development cluster (GO:GNF0004) (red). Again, for comparison, the MAP score distribution of motif candidates derived from MDScan analysis of 100 randomly selected sets of promoter sequences are also plotted (blue) representing the MAP score range that is obtainable by chance. MAP scores for motifs obtained using MDScan analysis of the Sexual Development cluster do not distance themselves from those obtained in random simulations suggesting the potential for many false positives in MDScan motif discovery.

Mentions: For each cluster, this GEMS analysis resulted in a large number of putative regulatory elements possessing widely varying p-values. In order to determine the p-value range most likely to represent true positives, we performed GEMS analyses on 100 independently selected random positive sets of promoter sequences to estimate the likelihood of obtaining any given p-value by chance. These exercises demonstrated that p-values for motifs identified from these random promoter sets were consistently higher than 10-8, whereas in comparison p-values for the top motifs identified using a promoter positive set derived from the Sexual Development OPI cluster (GO:GNF0004) were well below 10-10 (Figure 3a). These results suggested biological relevance for the top scoring motifs identified by GEMS using OPI cluster-derived promoter sets, as similar p-value scores could not be obtained in the absence of biological-based co-expression data (permutation simulations). In contrast, when MDScan was applied to the same Sexual Development OPI cluster positive promoter set and randomly selected positive promoter sets, the delimitation between scores for motifs identified from the true positive set and 100 random positive sets was much less pronounced (Figure 3b). This indicates the best scoring motifs identified using MDScan may not be necessarily biologically relevant as similarly scored motifs can be obtained from randomly selected sets of sequences. Application of MEME to the same cluster resulted in motifs with extremely high E-values or very insignificant group-specificity scores.


In silico discovery of transcription regulatory elements in Plasmodium falciparum.

Young JA, Johnson JR, Benner C, Yan SF, Chen K, Le Roch KG, Zhou Y, Winzeler EA - BMC Genomics (2008)

GEMS and MDScan random permutation analyses. a) log10P enrichment score distribution of motif candidates derived from GEMS analysis of the upstream regions of genes contained within Sexual Development cluster (GO:GNF0004) (red). For comparison, the log10P enrichment score distribution of motif candidates derived from GEMS analysis of 100 randomly selected sets of promoter sequences are also plotted (blue) representing the p-value range that is obtainable by chance, i.e. potential false positives. log10P values for the top 20 motifs from GEMS analysis of the Sexual Development cluster all fall below those obtainable by random simulations. b) MAP score distribution of motif candidates derived from MDScan analysis of the Sexual Development cluster (GO:GNF0004) (red). Again, for comparison, the MAP score distribution of motif candidates derived from MDScan analysis of 100 randomly selected sets of promoter sequences are also plotted (blue) representing the MAP score range that is obtainable by chance. MAP scores for motifs obtained using MDScan analysis of the Sexual Development cluster do not distance themselves from those obtained in random simulations suggesting the potential for many false positives in MDScan motif discovery.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2268928&req=5

Figure 3: GEMS and MDScan random permutation analyses. a) log10P enrichment score distribution of motif candidates derived from GEMS analysis of the upstream regions of genes contained within Sexual Development cluster (GO:GNF0004) (red). For comparison, the log10P enrichment score distribution of motif candidates derived from GEMS analysis of 100 randomly selected sets of promoter sequences are also plotted (blue) representing the p-value range that is obtainable by chance, i.e. potential false positives. log10P values for the top 20 motifs from GEMS analysis of the Sexual Development cluster all fall below those obtainable by random simulations. b) MAP score distribution of motif candidates derived from MDScan analysis of the Sexual Development cluster (GO:GNF0004) (red). Again, for comparison, the MAP score distribution of motif candidates derived from MDScan analysis of 100 randomly selected sets of promoter sequences are also plotted (blue) representing the MAP score range that is obtainable by chance. MAP scores for motifs obtained using MDScan analysis of the Sexual Development cluster do not distance themselves from those obtained in random simulations suggesting the potential for many false positives in MDScan motif discovery.
Mentions: For each cluster, this GEMS analysis resulted in a large number of putative regulatory elements possessing widely varying p-values. In order to determine the p-value range most likely to represent true positives, we performed GEMS analyses on 100 independently selected random positive sets of promoter sequences to estimate the likelihood of obtaining any given p-value by chance. These exercises demonstrated that p-values for motifs identified from these random promoter sets were consistently higher than 10-8, whereas in comparison p-values for the top motifs identified using a promoter positive set derived from the Sexual Development OPI cluster (GO:GNF0004) were well below 10-10 (Figure 3a). These results suggested biological relevance for the top scoring motifs identified by GEMS using OPI cluster-derived promoter sets, as similar p-value scores could not be obtained in the absence of biological-based co-expression data (permutation simulations). In contrast, when MDScan was applied to the same Sexual Development OPI cluster positive promoter set and randomly selected positive promoter sets, the delimitation between scores for motifs identified from the true positive set and 100 random positive sets was much less pronounced (Figure 3b). This indicates the best scoring motifs identified using MDScan may not be necessarily biologically relevant as similarly scored motifs can be obtained from randomly selected sets of sequences. Application of MEME to the same cluster resulted in motifs with extremely high E-values or very insignificant group-specificity scores.

Bottom Line: When applied to promoter regions of genes contained within 21 co-expression gene clusters generated from P. falciparum life cycle microarray data using the semi-supervised clustering algorithm Ontology-based Pattern Identification, GEMS identified 34 putative cis-regulatory elements associated with a variety of parasite processes including sexual development, cell invasion, antigenic variation and protein biosynthesis.This GEMS analysis demonstrates that in silico regulatory element discovery can be successfully applied to challenging repeat-sequence-rich, base-biased genomes such as that of P. falciparum.The putative regulatory elements described represent promising candidates for future biological investigation into the underlying transcriptional control mechanisms of gene regulation in malaria parasites.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Cell Biology, ICND 202, The Scripps Research Institute, 10550 North Torrey Pines Road, La Jolla, CA 92037, USA. jayoung@scripps.edu

ABSTRACT

Background: With the sequence of the Plasmodium falciparum genome and several global mRNA and protein life cycle expression profiling projects now completed, elucidating the underlying networks of transcriptional control important for the progression of the parasite life cycle is highly pertinent to the development of new anti-malarials. To date, relatively little is known regarding the specific mechanisms the parasite employs to regulate gene expression at the mRNA level, with studies of the P. falciparum genome sequence having revealed few cis-regulatory elements and associated transcription factors. Although it is possible the parasite may evoke mechanisms of transcriptional control drastically different from those used by other eukaryotic organisms, the extreme AT-rich nature of P. falciparum intergenic regions (approximately 90% AT) presents significant challenges to in silico cis-regulatory element discovery.

Results: We have developed an algorithm called Gene Enrichment Motif Searching (GEMS) that uses a hypergeometric-based scoring function and a position-weight matrix optimization routine to identify with high-confidence regulatory elements in the nucleotide-biased and repeat sequence-rich P. falciparum genome. When applied to promoter regions of genes contained within 21 co-expression gene clusters generated from P. falciparum life cycle microarray data using the semi-supervised clustering algorithm Ontology-based Pattern Identification, GEMS identified 34 putative cis-regulatory elements associated with a variety of parasite processes including sexual development, cell invasion, antigenic variation and protein biosynthesis. Among these candidates were novel motifs, as well as many of the elements for which biological experimental evidence already exists in the Plasmodium literature. To provide evidence for the biological relevance of a cell invasion-related element predicted by GEMS, reporter gene and electrophoretic mobility shift assays were conducted.

Conclusion: This GEMS analysis demonstrates that in silico regulatory element discovery can be successfully applied to challenging repeat-sequence-rich, base-biased genomes such as that of P. falciparum. The fact that regulatory elements were predicted from a diverse range of functional gene clusters supports the hypothesis that cis-regulatory elements play a role in the transcriptional control of many P. falciparum biological processes. The putative regulatory elements described represent promising candidates for future biological investigation into the underlying transcriptional control mechanisms of gene regulation in malaria parasites.

Show MeSH
Related in: MedlinePlus