Limits...
Machine learning for regulatory analysis and transcription factor target prediction in yeast.

Holloway DT, Kon M, Delisi C - Syst Synth Biol (2007)

Bottom Line: For the purpose of illustration we discuss several results, including biochemical pathway predictions for Gcn4 and Rap1.Future work on this method will focus on increasing the accuracy and quality of predictions using feature reduction and clustering strategies.Since predictions have been made on only 104 TFs in yeast, new classifiers will be built for the remaining 100 factors which have available binding data.

View Article: PubMed Central - PubMed

Affiliation: Molecular Biology Cell Biology and Biochemistry, Boston University, Boston, MA, 02215, USA, dth128@bu.edu.

ABSTRACT
High throughput technologies, including array-based chromatin immunoprecipitation, have rapidly increased our knowledge of transcriptional maps-the identity and location of regulatory binding sites within genomes. Still, the full identification of sites, even in lower eukaryotes, remains largely incomplete. In this paper we develop a supervised learning approach to site identification using support vector machines (SVMs) to combine 26 different data types. A comparison with the standard approach to site identification using position specific scoring matrices (PSSMs) for a set of 104 Saccharomyces cerevisiae regulators indicates that our SVM-based target classification is more sensitive (73 vs. 20%) when specificity and positive predictive value are the same. We have applied our SVM classifier for each transcriptional regulator to all promoters in the yeast genome to obtain thousands of new targets, which are currently being analyzed and refined to limit the risk of classifier over-fitting. For the purpose of illustration we discuss several results, including biochemical pathway predictions for Gcn4 and Rap1. For both transcription factors SVM predictions match well with the known biology of control mechanisms, and possible new roles for these factors are suggested, such as a function for Rap1 in regulating fermentative growth. We also examine the promoter melting temperature curves for the targets of YJR060W, and show that targets of this TF have potentially unique physical properties which distinguish them from other genes. The SVM output automatically provides the means to rank dataset features to identify important biological elements. We use this property to rank classifying k-mers, thereby reconstructing known binding sites for several TFs, and to rank expression experiments, determining the conditions under which Fhl1, the factor responsible for expression of ribosomal protein genes, is active. We can see that targets of Fhl1 are differentially expressed in the chosen conditions as compared to the expression of average and negative set genes. SVM-based classifiers provide a robust framework for analysis of regulatory networks. Processing of classifier outputs can provide high quality predictions and biological insight into functions of particular transcription factors. Future work on this method will focus on increasing the accuracy and quality of predictions using feature reduction and clustering strategies. Since predictions have been made on only 104 TFs in yeast, new classifiers will be built for the remaining 100 factors which have available binding data.

No MeSH data available.


Related in: MedlinePlus

Flow diagram: synthesizing a single classifier for each TF from several data sets. A classifier is constructed for each individual TF for each genomic dataset, using every one of four possible kernel functions (26 datasets ×  104 TFs  ×  4 kernel functions = 10816 kernels from which SVM classifiers are built). For each of these classifiers optimal parameters are chosen by cross-validation. For each dataset and each TF, the best performing of the four kernel functions is selected, reducing the number of classifiers to 2704 (26 datasets  ×  104TFs). Finally, the datasets are combined based on F1 score of their best performing kernel so that there is only one classifier per TF
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2533145&req=5

Fig1: Flow diagram: synthesizing a single classifier for each TF from several data sets. A classifier is constructed for each individual TF for each genomic dataset, using every one of four possible kernel functions (26 datasets ×  104 TFs  ×  4 kernel functions = 10816 kernels from which SVM classifiers are built). For each of these classifiers optimal parameters are chosen by cross-validation. For each dataset and each TF, the best performing of the four kernel functions is selected, reducing the number of classifiers to 2704 (26 datasets  ×  104TFs). Finally, the datasets are combined based on F1 score of their best performing kernel so that there is only one classifier per TF

Mentions: A separate classifier is developed for each TF based on each independent dataset. The four kernel functions in Table 1 (linear, rbf, Gaussian, and polynomial) are tested using leave one out cross validation, and the function with the highest F1 score (below) is chosen as best for that particular TF–dataset combination. A flow diagram of our method can be seen in Fig. 1. Let TP denote the count of true positives, FN false negatives, etc. The F1 statistic is a robust measure that represents a harmonic mean between sensitivity (S), and positive predictive value (PPV). It is defined by\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$ F_1=\frac{2\times S\times \hbox{PPV}}{S+\hbox{PPV}}=\frac{2\times \hbox{TP}}{2\times\hbox{TP}+\hbox{FP}+\hbox{FN}}$$\end{document}Fig. 1


Machine learning for regulatory analysis and transcription factor target prediction in yeast.

Holloway DT, Kon M, Delisi C - Syst Synth Biol (2007)

Flow diagram: synthesizing a single classifier for each TF from several data sets. A classifier is constructed for each individual TF for each genomic dataset, using every one of four possible kernel functions (26 datasets ×  104 TFs  ×  4 kernel functions = 10816 kernels from which SVM classifiers are built). For each of these classifiers optimal parameters are chosen by cross-validation. For each dataset and each TF, the best performing of the four kernel functions is selected, reducing the number of classifiers to 2704 (26 datasets  ×  104TFs). Finally, the datasets are combined based on F1 score of their best performing kernel so that there is only one classifier per TF
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2533145&req=5

Fig1: Flow diagram: synthesizing a single classifier for each TF from several data sets. A classifier is constructed for each individual TF for each genomic dataset, using every one of four possible kernel functions (26 datasets ×  104 TFs  ×  4 kernel functions = 10816 kernels from which SVM classifiers are built). For each of these classifiers optimal parameters are chosen by cross-validation. For each dataset and each TF, the best performing of the four kernel functions is selected, reducing the number of classifiers to 2704 (26 datasets  ×  104TFs). Finally, the datasets are combined based on F1 score of their best performing kernel so that there is only one classifier per TF
Mentions: A separate classifier is developed for each TF based on each independent dataset. The four kernel functions in Table 1 (linear, rbf, Gaussian, and polynomial) are tested using leave one out cross validation, and the function with the highest F1 score (below) is chosen as best for that particular TF–dataset combination. A flow diagram of our method can be seen in Fig. 1. Let TP denote the count of true positives, FN false negatives, etc. The F1 statistic is a robust measure that represents a harmonic mean between sensitivity (S), and positive predictive value (PPV). It is defined by\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$ F_1=\frac{2\times S\times \hbox{PPV}}{S+\hbox{PPV}}=\frac{2\times \hbox{TP}}{2\times\hbox{TP}+\hbox{FP}+\hbox{FN}}$$\end{document}Fig. 1

Bottom Line: For the purpose of illustration we discuss several results, including biochemical pathway predictions for Gcn4 and Rap1.Future work on this method will focus on increasing the accuracy and quality of predictions using feature reduction and clustering strategies.Since predictions have been made on only 104 TFs in yeast, new classifiers will be built for the remaining 100 factors which have available binding data.

View Article: PubMed Central - PubMed

Affiliation: Molecular Biology Cell Biology and Biochemistry, Boston University, Boston, MA, 02215, USA, dth128@bu.edu.

ABSTRACT
High throughput technologies, including array-based chromatin immunoprecipitation, have rapidly increased our knowledge of transcriptional maps-the identity and location of regulatory binding sites within genomes. Still, the full identification of sites, even in lower eukaryotes, remains largely incomplete. In this paper we develop a supervised learning approach to site identification using support vector machines (SVMs) to combine 26 different data types. A comparison with the standard approach to site identification using position specific scoring matrices (PSSMs) for a set of 104 Saccharomyces cerevisiae regulators indicates that our SVM-based target classification is more sensitive (73 vs. 20%) when specificity and positive predictive value are the same. We have applied our SVM classifier for each transcriptional regulator to all promoters in the yeast genome to obtain thousands of new targets, which are currently being analyzed and refined to limit the risk of classifier over-fitting. For the purpose of illustration we discuss several results, including biochemical pathway predictions for Gcn4 and Rap1. For both transcription factors SVM predictions match well with the known biology of control mechanisms, and possible new roles for these factors are suggested, such as a function for Rap1 in regulating fermentative growth. We also examine the promoter melting temperature curves for the targets of YJR060W, and show that targets of this TF have potentially unique physical properties which distinguish them from other genes. The SVM output automatically provides the means to rank dataset features to identify important biological elements. We use this property to rank classifying k-mers, thereby reconstructing known binding sites for several TFs, and to rank expression experiments, determining the conditions under which Fhl1, the factor responsible for expression of ribosomal protein genes, is active. We can see that targets of Fhl1 are differentially expressed in the chosen conditions as compared to the expression of average and negative set genes. SVM-based classifiers provide a robust framework for analysis of regulatory networks. Processing of classifier outputs can provide high quality predictions and biological insight into functions of particular transcription factors. Future work on this method will focus on increasing the accuracy and quality of predictions using feature reduction and clustering strategies. Since predictions have been made on only 104 TFs in yeast, new classifiers will be built for the remaining 100 factors which have available binding data.

No MeSH data available.


Related in: MedlinePlus