Limits...
Improved predictions of transcription factor binding sites using physicochemical features of DNA.

Maienschein-Cline M, Dinner AR, Hlavacek WS, Mu F - Nucleic Acids Res. (2012)

Bottom Line: New features are derived from Gibbs energies of amino acid-DNA interactions and hydroxyl radical cleavage profiles of DNA.Algorithmic modifications consist of inclusion of a feature selection step, use of a nonlinear kernel in the SVM classifier, and use of a consensus-based post-processing step for predictions.We also considered SVM classification based on letter features alone to distinguish performance gains from use of SVM-based models versus use of physicochemical features.

View Article: PubMed Central - PubMed

Affiliation: Department of Chemistry, University of Chicago, Chicago, IL 60637, USA.

ABSTRACT
Typical approaches for predicting transcription factor binding sites (TFBSs) involve use of a position-specific weight matrix (PWM) to statistically characterize the sequences of the known sites. Recently, an alternative physicochemical approach, called SiteSleuth, was proposed. In this approach, a linear support vector machine (SVM) classifier is trained to distinguish TFBSs from background sequences based on local chemical and structural features of DNA. SiteSleuth appears to generally perform better than PWM-based methods. Here, we improve the SiteSleuth approach by considering both new physicochemical features and algorithmic modifications. New features are derived from Gibbs energies of amino acid-DNA interactions and hydroxyl radical cleavage profiles of DNA. Algorithmic modifications consist of inclusion of a feature selection step, use of a nonlinear kernel in the SVM classifier, and use of a consensus-based post-processing step for predictions. We also considered SVM classification based on letter features alone to distinguish performance gains from use of SVM-based models versus use of physicochemical features. The accuracy of each of the variant methods considered was assessed by cross validation using data available in the RegulonDB database for 54 Escherichia coli TFs, as well as by experimental validation using published ChIP-chip data available for Fis and Lrp.

Show MeSH

Related in: MedlinePlus

Schematic of amino acid–DNA interaction. The DNA structure for sequence CAG (and two flanking GC pairs on either end) is superimposed with 108 grid points around the center adenosine nucleotide arranged in a 6 × 6 × 3 grid and a sample isoleucine structure; the three sub-grids in the DNA minor groove, major groove and outside the DNA are colored orange, red and purple, respectively. G is colored green, C blue, A yellow and T red. In the calculations, the α-C of the amino acid is centered at each grid point and rotations and energy calculations are performed as described in Materials and Methods. The grid is defined by six bounding planes: the bounding planes above and below the grid are centered halfway between the central A and the adjacent nucleotides C and G above and below, respectively, and are parallel to each other as well as to the plane of the rings in adenine. The bounding plane on the left is centered between the A and T base-pairing nucleotides and is perpendicular to the previously defined planes. The bounding plane on the right is placed 20Å from the plane on the left (and parallel to it). The bounding planes in and out of the page are perpendicular to all previously defined planes and 10Å in or out from the center of the adenine ring. Thus the volume of the grid is 20 × 20 × D Å3, where D is the distance between adjacent nucleotides, typically about 3.5 Å. This figure was created using VMD (51,52).
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3526315&req=5

gks771-F1: Schematic of amino acid–DNA interaction. The DNA structure for sequence CAG (and two flanking GC pairs on either end) is superimposed with 108 grid points around the center adenosine nucleotide arranged in a 6 × 6 × 3 grid and a sample isoleucine structure; the three sub-grids in the DNA minor groove, major groove and outside the DNA are colored orange, red and purple, respectively. G is colored green, C blue, A yellow and T red. In the calculations, the α-C of the amino acid is centered at each grid point and rotations and energy calculations are performed as described in Materials and Methods. The grid is defined by six bounding planes: the bounding planes above and below the grid are centered halfway between the central A and the adjacent nucleotides C and G above and below, respectively, and are parallel to each other as well as to the plane of the rings in adenine. The bounding plane on the left is centered between the A and T base-pairing nucleotides and is perpendicular to the previously defined planes. The bounding plane on the right is placed 20Å from the plane on the left (and parallel to it). The bounding planes in and out of the page are perpendicular to all previously defined planes and 10Å in or out from the center of the adenine ring. Thus the volume of the grid is 20 × 20 × D Å3, where D is the distance between adjacent nucleotides, typically about 3.5 Å. This figure was created using VMD (51,52).

Mentions: Initial structures of amino acids capped by an acetyl group at the N-terminus and an N-methylamide group at the C-terminus were obtained using CHARMM34b1 (55). These structures were paired with equilibrated, average structures of DNA 3-mers (with flanking GC dinucleotides), calculated as described above. For each of the 20 × 64 amino acid-DNA duplex pairs, we considered the following spatial configurations of the two molecules. As illustrated in Figure 1, for each of the two central nucleotides in the DNA duplex, we considered a 6 × 6 × 3 grid filling a rectangular box. For each grid, the α-carbon of the amino acid probe was placed at each of the 108 grid points. The initial orientation of the amino acid was arbitrary but consistent across grid points. At each grid point, we considered 81 distinct whole-molecule rotations. Each of these rotations was a composition of a rotation of angle θ ∈ [−π/2, −7π/18, −5π/18, … , π/2] around the x-axis and a rotation of angle ϕ ∈ [−π, −8π/9, −7π/9, … , π] around the z-axis (9 rotations each). At each grid point, we also considered fifteen side-chain rotamers for all amino acid probes except alanine, glycine and proline. Rotamers of an amino acid were generated by rotating the side chain around the bond between the α- and β-carbons (χ1 dihedral; additional rotations around χ2, χ3 etc. were not considered). The angles of rotations were integer multiples of 2π/15 (see Figure S1 of the Supplementary Materials for more details on how the angle increments were chosen). Thus, for each side of a DNA duplex, we considered a total of 2 256 984 probe-duplex configurations (108 × 81 × (17 × 15 + 3)).Figure 1.


Improved predictions of transcription factor binding sites using physicochemical features of DNA.

Maienschein-Cline M, Dinner AR, Hlavacek WS, Mu F - Nucleic Acids Res. (2012)

Schematic of amino acid–DNA interaction. The DNA structure for sequence CAG (and two flanking GC pairs on either end) is superimposed with 108 grid points around the center adenosine nucleotide arranged in a 6 × 6 × 3 grid and a sample isoleucine structure; the three sub-grids in the DNA minor groove, major groove and outside the DNA are colored orange, red and purple, respectively. G is colored green, C blue, A yellow and T red. In the calculations, the α-C of the amino acid is centered at each grid point and rotations and energy calculations are performed as described in Materials and Methods. The grid is defined by six bounding planes: the bounding planes above and below the grid are centered halfway between the central A and the adjacent nucleotides C and G above and below, respectively, and are parallel to each other as well as to the plane of the rings in adenine. The bounding plane on the left is centered between the A and T base-pairing nucleotides and is perpendicular to the previously defined planes. The bounding plane on the right is placed 20Å from the plane on the left (and parallel to it). The bounding planes in and out of the page are perpendicular to all previously defined planes and 10Å in or out from the center of the adenine ring. Thus the volume of the grid is 20 × 20 × D Å3, where D is the distance between adjacent nucleotides, typically about 3.5 Å. This figure was created using VMD (51,52).
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3526315&req=5

gks771-F1: Schematic of amino acid–DNA interaction. The DNA structure for sequence CAG (and two flanking GC pairs on either end) is superimposed with 108 grid points around the center adenosine nucleotide arranged in a 6 × 6 × 3 grid and a sample isoleucine structure; the three sub-grids in the DNA minor groove, major groove and outside the DNA are colored orange, red and purple, respectively. G is colored green, C blue, A yellow and T red. In the calculations, the α-C of the amino acid is centered at each grid point and rotations and energy calculations are performed as described in Materials and Methods. The grid is defined by six bounding planes: the bounding planes above and below the grid are centered halfway between the central A and the adjacent nucleotides C and G above and below, respectively, and are parallel to each other as well as to the plane of the rings in adenine. The bounding plane on the left is centered between the A and T base-pairing nucleotides and is perpendicular to the previously defined planes. The bounding plane on the right is placed 20Å from the plane on the left (and parallel to it). The bounding planes in and out of the page are perpendicular to all previously defined planes and 10Å in or out from the center of the adenine ring. Thus the volume of the grid is 20 × 20 × D Å3, where D is the distance between adjacent nucleotides, typically about 3.5 Å. This figure was created using VMD (51,52).
Mentions: Initial structures of amino acids capped by an acetyl group at the N-terminus and an N-methylamide group at the C-terminus were obtained using CHARMM34b1 (55). These structures were paired with equilibrated, average structures of DNA 3-mers (with flanking GC dinucleotides), calculated as described above. For each of the 20 × 64 amino acid-DNA duplex pairs, we considered the following spatial configurations of the two molecules. As illustrated in Figure 1, for each of the two central nucleotides in the DNA duplex, we considered a 6 × 6 × 3 grid filling a rectangular box. For each grid, the α-carbon of the amino acid probe was placed at each of the 108 grid points. The initial orientation of the amino acid was arbitrary but consistent across grid points. At each grid point, we considered 81 distinct whole-molecule rotations. Each of these rotations was a composition of a rotation of angle θ ∈ [−π/2, −7π/18, −5π/18, … , π/2] around the x-axis and a rotation of angle ϕ ∈ [−π, −8π/9, −7π/9, … , π] around the z-axis (9 rotations each). At each grid point, we also considered fifteen side-chain rotamers for all amino acid probes except alanine, glycine and proline. Rotamers of an amino acid were generated by rotating the side chain around the bond between the α- and β-carbons (χ1 dihedral; additional rotations around χ2, χ3 etc. were not considered). The angles of rotations were integer multiples of 2π/15 (see Figure S1 of the Supplementary Materials for more details on how the angle increments were chosen). Thus, for each side of a DNA duplex, we considered a total of 2 256 984 probe-duplex configurations (108 × 81 × (17 × 15 + 3)).Figure 1.

Bottom Line: New features are derived from Gibbs energies of amino acid-DNA interactions and hydroxyl radical cleavage profiles of DNA.Algorithmic modifications consist of inclusion of a feature selection step, use of a nonlinear kernel in the SVM classifier, and use of a consensus-based post-processing step for predictions.We also considered SVM classification based on letter features alone to distinguish performance gains from use of SVM-based models versus use of physicochemical features.

View Article: PubMed Central - PubMed

Affiliation: Department of Chemistry, University of Chicago, Chicago, IL 60637, USA.

ABSTRACT
Typical approaches for predicting transcription factor binding sites (TFBSs) involve use of a position-specific weight matrix (PWM) to statistically characterize the sequences of the known sites. Recently, an alternative physicochemical approach, called SiteSleuth, was proposed. In this approach, a linear support vector machine (SVM) classifier is trained to distinguish TFBSs from background sequences based on local chemical and structural features of DNA. SiteSleuth appears to generally perform better than PWM-based methods. Here, we improve the SiteSleuth approach by considering both new physicochemical features and algorithmic modifications. New features are derived from Gibbs energies of amino acid-DNA interactions and hydroxyl radical cleavage profiles of DNA. Algorithmic modifications consist of inclusion of a feature selection step, use of a nonlinear kernel in the SVM classifier, and use of a consensus-based post-processing step for predictions. We also considered SVM classification based on letter features alone to distinguish performance gains from use of SVM-based models versus use of physicochemical features. The accuracy of each of the variant methods considered was assessed by cross validation using data available in the RegulonDB database for 54 Escherichia coli TFs, as well as by experimental validation using published ChIP-chip data available for Fis and Lrp.

Show MeSH
Related in: MedlinePlus