Limits...
Re-visiting protein-centric two-tier classification of existing DNA-protein complexes.

Malhotra S, Sowdhamini R - BMC Bioinformatics (2012)

Bottom Line: Our results suggest that with the increasing number of availability of DNA-protein complexes in Protein Data Bank, the number of families in the classification increased by approximately three fold.The proposed re-visited classification can be used to perform genome-wide surveys in the genomes of interest for the presence of DNA-binding proteins.Further analysis of these complexes can aid in developing algorithms for identifying DNA-binding proteins and their family members from mere sequence information.

View Article: PubMed Central - HTML - PubMed

Affiliation: National Centre for Biological Sciences, UAS-GKVK Campus, Bangalore 560 065, India.

ABSTRACT

Background: Precise DNA-protein interactions play most important and vital role in maintaining the normal physiological functioning of the cell, as it controls many high fidelity cellular processes. Detailed study of the nature of these interactions has paved the way for understanding the mechanisms behind the biological processes in which they are involved. Earlier in 2000, a systematic classification of DNA-protein complexes based on the structural analysis of the proteins was proposed at two tiers, namely groups and families. With the advancement in the number and resolution of structures of DNA-protein complexes deposited in the Protein Data Bank, it is important to revisit the existing classification.

Results: On the basis of the sequence analysis of DNA binding proteins, we have built upon the protein centric, two-tier classification of DNA-protein complexes by adding new members to existing families and making new families and groups. While classifying the new complexes, we also realised the emergence of new groups and families. The new group observed was where β-propeller was seen to interact with DNA. There were 34 SCOP folds which were observed to be present in the complexes of both old and new classifications, whereas 28 folds are present exclusively in the new complexes. Some new families noticed were NarL transcription factor, Z-α DNA binding proteins, Forkhead transcription factor, AP2 protein, Methyl CpG binding protein etc.

Conclusions: Our results suggest that with the increasing number of availability of DNA-protein complexes in Protein Data Bank, the number of families in the classification increased by approximately three fold. The folds present exclusively in newly classified complexes is suggestive of inclusion of proteins with new function in new classification, the most populated of which are the folds responsible for DNA damage repair. The proposed re-visited classification can be used to perform genome-wide surveys in the genomes of interest for the presence of DNA-binding proteins. Further analysis of these complexes can aid in developing algorithms for identifying DNA-binding proteins and their family members from mere sequence information.

Show MeSH

Related in: MedlinePlus

Pairwise percentage identity distribution for new families. (A). Boxplot for pairwise percent distribution for 18 multi-member families of group HTH, the following are the family names 1. *Cro and Cro like 2. *Homeodomain 3. LacI repressor 4. Hin recombinase 5. *RAP1 6. *Iron Dependent repressor 7. *Transcription factor IIB/IIA 8. NarL transcription factor 9. *Tn5 Transposase 10. *MutS 11. *Tetracycline Repressor 12. Interferon regulatory factor 13. *Catabolite gene activator protein 14. Transcription factor 15. *Ets domain 16. *Z-α domain 17. *Forkhead TF 18. Transcription activator BMRR (B). Boxplot for pairwise percent distribution for 20* multi-member families of group Zinc co-ordinating, Zipper type, Other α-helix, β-sheet, β-hairpin/ribbon and Other. The following are the family names- 1. *Zinc-coordinating-β-β-α-zinc finger 2. *Zinc-coordinating-Nuclear Receptors 3. *Zinc-coordinating-Loop-sheet-helix 4. *Zinc-coordinating-gal4 5. *Zipper-type-bzip1 6. *Zipper-type-bzip2 7. *Zipper-type-Helix-loop-helix 8. 8Other α-helix -histone 9. *Other α-helix -Site specific recombinases 10. *Other α-helix -High mobility group 11. Other α-helix -MADS box 12. *Other α-helix -CUT domain 13. *β-sheet-TATA box-binding 14. *β-hairpin-MetJ repressor 15. β-hairpin-Tus replication terminator 16. *β-hairpin-Integration host factor 17. *β-hairpin-Hyperthermophile DNA-BP 18. *β –hairpin-Arc repressor 19. β-hairpin-Omega Repressor 20. *β-hairpin-SRA Domain 21. *Other-Ig fold like TF 22. *Other-Seq A*Total number of plots are 21 in Figure 9(B) as bzip1 and bzip2 boxplots are different but they are subfamilies of single Leucine zipper family (C). Boxplot for pairwise percent distribution for 43 multi-member families of group enzymes. The following are the family names- 1. Methyltransferase 2. Endonuclease PvuII 3. Endonuclease ecorV 4. *Endonuclease ecorI 5. Endonuclease BamHI 6. *Enonuclease V 7. *Dnase I 8. *HIV reverse transcriptase 9. *Uracil-DNA glycosylase 10. 3-Methyladenine DNA glycosylase 11. *Homing endonuclease 12. *Topoisomerase I 13. T7 RNA Pol 14. N4 RNA Pol 15. HincII restriction endonuclease 16. *Endonuclease III and MutY 17. *DNA Photolyase 18. α-glucosyl transferase 19. *Helicase 20. *Thymine DNA-glycosylase 21. 8-oxoguanine DNA glycosylase 22. ALKA 23. Phi 6 RNA Pol 24. β-Glucosyltransferase 25. *Endonuclease VIII and MutM 26. Human tyrosyl-DNA phosphodiesterase 27. Relaxase TrwC 28. *Nuclease-Colicin 29. Endonuclease IV 30. Excisionase (Xis) 31. ISHp608 Transposase 32. AlkB 33. Restriction Endonuclease HinP1I 34. ABH2 35. Restriction endonuclease SgrAI 36. *Family A Polymerases 37. *Family B Polymerases 38. *Family X Polymerases 39. *Family Y Polymerases 40. *DNA Ligase 41. *Family C polymerases 42. Mtaq 1 methylase 43.* DAM(Stars in front of the family name implies it has wide distribution of percent identity and further the family was subjected to Jack-knifing for selecting the representative)
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3472317&req=5

Figure 10: Pairwise percentage identity distribution for new families. (A). Boxplot for pairwise percent distribution for 18 multi-member families of group HTH, the following are the family names 1. *Cro and Cro like 2. *Homeodomain 3. LacI repressor 4. Hin recombinase 5. *RAP1 6. *Iron Dependent repressor 7. *Transcription factor IIB/IIA 8. NarL transcription factor 9. *Tn5 Transposase 10. *MutS 11. *Tetracycline Repressor 12. Interferon regulatory factor 13. *Catabolite gene activator protein 14. Transcription factor 15. *Ets domain 16. *Z-α domain 17. *Forkhead TF 18. Transcription activator BMRR (B). Boxplot for pairwise percent distribution for 20* multi-member families of group Zinc co-ordinating, Zipper type, Other α-helix, β-sheet, β-hairpin/ribbon and Other. The following are the family names- 1. *Zinc-coordinating-β-β-α-zinc finger 2. *Zinc-coordinating-Nuclear Receptors 3. *Zinc-coordinating-Loop-sheet-helix 4. *Zinc-coordinating-gal4 5. *Zipper-type-bzip1 6. *Zipper-type-bzip2 7. *Zipper-type-Helix-loop-helix 8. 8Other α-helix -histone 9. *Other α-helix -Site specific recombinases 10. *Other α-helix -High mobility group 11. Other α-helix -MADS box 12. *Other α-helix -CUT domain 13. *β-sheet-TATA box-binding 14. *β-hairpin-MetJ repressor 15. β-hairpin-Tus replication terminator 16. *β-hairpin-Integration host factor 17. *β-hairpin-Hyperthermophile DNA-BP 18. *β –hairpin-Arc repressor 19. β-hairpin-Omega Repressor 20. *β-hairpin-SRA Domain 21. *Other-Ig fold like TF 22. *Other-Seq A*Total number of plots are 21 in Figure 9(B) as bzip1 and bzip2 boxplots are different but they are subfamilies of single Leucine zipper family (C). Boxplot for pairwise percent distribution for 43 multi-member families of group enzymes. The following are the family names- 1. Methyltransferase 2. Endonuclease PvuII 3. Endonuclease ecorV 4. *Endonuclease ecorI 5. Endonuclease BamHI 6. *Enonuclease V 7. *Dnase I 8. *HIV reverse transcriptase 9. *Uracil-DNA glycosylase 10. 3-Methyladenine DNA glycosylase 11. *Homing endonuclease 12. *Topoisomerase I 13. T7 RNA Pol 14. N4 RNA Pol 15. HincII restriction endonuclease 16. *Endonuclease III and MutY 17. *DNA Photolyase 18. α-glucosyl transferase 19. *Helicase 20. *Thymine DNA-glycosylase 21. 8-oxoguanine DNA glycosylase 22. ALKA 23. Phi 6 RNA Pol 24. β-Glucosyltransferase 25. *Endonuclease VIII and MutM 26. Human tyrosyl-DNA phosphodiesterase 27. Relaxase TrwC 28. *Nuclease-Colicin 29. Endonuclease IV 30. Excisionase (Xis) 31. ISHp608 Transposase 32. AlkB 33. Restriction Endonuclease HinP1I 34. ABH2 35. Restriction endonuclease SgrAI 36. *Family A Polymerases 37. *Family B Polymerases 38. *Family X Polymerases 39. *Family Y Polymerases 40. *DNA Ligase 41. *Family C polymerases 42. Mtaq 1 methylase 43.* DAM(Stars in front of the family name implies it has wide distribution of percent identity and further the family was subjected to Jack-knifing for selecting the representative)

Mentions: It was observed that for all the groups, in total, there were 57 single-member, 35 two- member and 82 multi-member families. New representatives were also selected for these 174 new families. For 57 single-member families, the member itself was a representative. In two-member families, equal chance to each member was given to become a representative and the one having 100% coverage was selected as representative. For multi-member families, the pairwise percentage identity distribution in the form of box-plot is represented in Figure 10. Out of 82 multi-member families, 32 were observed as having narrow percent identity distribution, whereas 50 families (marked with star Figure 10) were having wide distribution of percentage identity. For 47 families, leave-one-out approach was adopted to find best representative. There were three families (out of 50) with wide percentage identities with >50 members- Family A DNA Pol, Family X DNA Pol and Family Y DNA Pol, where clustering was performed followed by assessing the representative from every cluster both individually and in combination to assess its coverage.


Re-visiting protein-centric two-tier classification of existing DNA-protein complexes.

Malhotra S, Sowdhamini R - BMC Bioinformatics (2012)

Pairwise percentage identity distribution for new families. (A). Boxplot for pairwise percent distribution for 18 multi-member families of group HTH, the following are the family names 1. *Cro and Cro like 2. *Homeodomain 3. LacI repressor 4. Hin recombinase 5. *RAP1 6. *Iron Dependent repressor 7. *Transcription factor IIB/IIA 8. NarL transcription factor 9. *Tn5 Transposase 10. *MutS 11. *Tetracycline Repressor 12. Interferon regulatory factor 13. *Catabolite gene activator protein 14. Transcription factor 15. *Ets domain 16. *Z-α domain 17. *Forkhead TF 18. Transcription activator BMRR (B). Boxplot for pairwise percent distribution for 20* multi-member families of group Zinc co-ordinating, Zipper type, Other α-helix, β-sheet, β-hairpin/ribbon and Other. The following are the family names- 1. *Zinc-coordinating-β-β-α-zinc finger 2. *Zinc-coordinating-Nuclear Receptors 3. *Zinc-coordinating-Loop-sheet-helix 4. *Zinc-coordinating-gal4 5. *Zipper-type-bzip1 6. *Zipper-type-bzip2 7. *Zipper-type-Helix-loop-helix 8. 8Other α-helix -histone 9. *Other α-helix -Site specific recombinases 10. *Other α-helix -High mobility group 11. Other α-helix -MADS box 12. *Other α-helix -CUT domain 13. *β-sheet-TATA box-binding 14. *β-hairpin-MetJ repressor 15. β-hairpin-Tus replication terminator 16. *β-hairpin-Integration host factor 17. *β-hairpin-Hyperthermophile DNA-BP 18. *β –hairpin-Arc repressor 19. β-hairpin-Omega Repressor 20. *β-hairpin-SRA Domain 21. *Other-Ig fold like TF 22. *Other-Seq A*Total number of plots are 21 in Figure 9(B) as bzip1 and bzip2 boxplots are different but they are subfamilies of single Leucine zipper family (C). Boxplot for pairwise percent distribution for 43 multi-member families of group enzymes. The following are the family names- 1. Methyltransferase 2. Endonuclease PvuII 3. Endonuclease ecorV 4. *Endonuclease ecorI 5. Endonuclease BamHI 6. *Enonuclease V 7. *Dnase I 8. *HIV reverse transcriptase 9. *Uracil-DNA glycosylase 10. 3-Methyladenine DNA glycosylase 11. *Homing endonuclease 12. *Topoisomerase I 13. T7 RNA Pol 14. N4 RNA Pol 15. HincII restriction endonuclease 16. *Endonuclease III and MutY 17. *DNA Photolyase 18. α-glucosyl transferase 19. *Helicase 20. *Thymine DNA-glycosylase 21. 8-oxoguanine DNA glycosylase 22. ALKA 23. Phi 6 RNA Pol 24. β-Glucosyltransferase 25. *Endonuclease VIII and MutM 26. Human tyrosyl-DNA phosphodiesterase 27. Relaxase TrwC 28. *Nuclease-Colicin 29. Endonuclease IV 30. Excisionase (Xis) 31. ISHp608 Transposase 32. AlkB 33. Restriction Endonuclease HinP1I 34. ABH2 35. Restriction endonuclease SgrAI 36. *Family A Polymerases 37. *Family B Polymerases 38. *Family X Polymerases 39. *Family Y Polymerases 40. *DNA Ligase 41. *Family C polymerases 42. Mtaq 1 methylase 43.* DAM(Stars in front of the family name implies it has wide distribution of percent identity and further the family was subjected to Jack-knifing for selecting the representative)
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3472317&req=5

Figure 10: Pairwise percentage identity distribution for new families. (A). Boxplot for pairwise percent distribution for 18 multi-member families of group HTH, the following are the family names 1. *Cro and Cro like 2. *Homeodomain 3. LacI repressor 4. Hin recombinase 5. *RAP1 6. *Iron Dependent repressor 7. *Transcription factor IIB/IIA 8. NarL transcription factor 9. *Tn5 Transposase 10. *MutS 11. *Tetracycline Repressor 12. Interferon regulatory factor 13. *Catabolite gene activator protein 14. Transcription factor 15. *Ets domain 16. *Z-α domain 17. *Forkhead TF 18. Transcription activator BMRR (B). Boxplot for pairwise percent distribution for 20* multi-member families of group Zinc co-ordinating, Zipper type, Other α-helix, β-sheet, β-hairpin/ribbon and Other. The following are the family names- 1. *Zinc-coordinating-β-β-α-zinc finger 2. *Zinc-coordinating-Nuclear Receptors 3. *Zinc-coordinating-Loop-sheet-helix 4. *Zinc-coordinating-gal4 5. *Zipper-type-bzip1 6. *Zipper-type-bzip2 7. *Zipper-type-Helix-loop-helix 8. 8Other α-helix -histone 9. *Other α-helix -Site specific recombinases 10. *Other α-helix -High mobility group 11. Other α-helix -MADS box 12. *Other α-helix -CUT domain 13. *β-sheet-TATA box-binding 14. *β-hairpin-MetJ repressor 15. β-hairpin-Tus replication terminator 16. *β-hairpin-Integration host factor 17. *β-hairpin-Hyperthermophile DNA-BP 18. *β –hairpin-Arc repressor 19. β-hairpin-Omega Repressor 20. *β-hairpin-SRA Domain 21. *Other-Ig fold like TF 22. *Other-Seq A*Total number of plots are 21 in Figure 9(B) as bzip1 and bzip2 boxplots are different but they are subfamilies of single Leucine zipper family (C). Boxplot for pairwise percent distribution for 43 multi-member families of group enzymes. The following are the family names- 1. Methyltransferase 2. Endonuclease PvuII 3. Endonuclease ecorV 4. *Endonuclease ecorI 5. Endonuclease BamHI 6. *Enonuclease V 7. *Dnase I 8. *HIV reverse transcriptase 9. *Uracil-DNA glycosylase 10. 3-Methyladenine DNA glycosylase 11. *Homing endonuclease 12. *Topoisomerase I 13. T7 RNA Pol 14. N4 RNA Pol 15. HincII restriction endonuclease 16. *Endonuclease III and MutY 17. *DNA Photolyase 18. α-glucosyl transferase 19. *Helicase 20. *Thymine DNA-glycosylase 21. 8-oxoguanine DNA glycosylase 22. ALKA 23. Phi 6 RNA Pol 24. β-Glucosyltransferase 25. *Endonuclease VIII and MutM 26. Human tyrosyl-DNA phosphodiesterase 27. Relaxase TrwC 28. *Nuclease-Colicin 29. Endonuclease IV 30. Excisionase (Xis) 31. ISHp608 Transposase 32. AlkB 33. Restriction Endonuclease HinP1I 34. ABH2 35. Restriction endonuclease SgrAI 36. *Family A Polymerases 37. *Family B Polymerases 38. *Family X Polymerases 39. *Family Y Polymerases 40. *DNA Ligase 41. *Family C polymerases 42. Mtaq 1 methylase 43.* DAM(Stars in front of the family name implies it has wide distribution of percent identity and further the family was subjected to Jack-knifing for selecting the representative)
Mentions: It was observed that for all the groups, in total, there were 57 single-member, 35 two- member and 82 multi-member families. New representatives were also selected for these 174 new families. For 57 single-member families, the member itself was a representative. In two-member families, equal chance to each member was given to become a representative and the one having 100% coverage was selected as representative. For multi-member families, the pairwise percentage identity distribution in the form of box-plot is represented in Figure 10. Out of 82 multi-member families, 32 were observed as having narrow percent identity distribution, whereas 50 families (marked with star Figure 10) were having wide distribution of percentage identity. For 47 families, leave-one-out approach was adopted to find best representative. There were three families (out of 50) with wide percentage identities with >50 members- Family A DNA Pol, Family X DNA Pol and Family Y DNA Pol, where clustering was performed followed by assessing the representative from every cluster both individually and in combination to assess its coverage.

Bottom Line: Our results suggest that with the increasing number of availability of DNA-protein complexes in Protein Data Bank, the number of families in the classification increased by approximately three fold.The proposed re-visited classification can be used to perform genome-wide surveys in the genomes of interest for the presence of DNA-binding proteins.Further analysis of these complexes can aid in developing algorithms for identifying DNA-binding proteins and their family members from mere sequence information.

View Article: PubMed Central - HTML - PubMed

Affiliation: National Centre for Biological Sciences, UAS-GKVK Campus, Bangalore 560 065, India.

ABSTRACT

Background: Precise DNA-protein interactions play most important and vital role in maintaining the normal physiological functioning of the cell, as it controls many high fidelity cellular processes. Detailed study of the nature of these interactions has paved the way for understanding the mechanisms behind the biological processes in which they are involved. Earlier in 2000, a systematic classification of DNA-protein complexes based on the structural analysis of the proteins was proposed at two tiers, namely groups and families. With the advancement in the number and resolution of structures of DNA-protein complexes deposited in the Protein Data Bank, it is important to revisit the existing classification.

Results: On the basis of the sequence analysis of DNA binding proteins, we have built upon the protein centric, two-tier classification of DNA-protein complexes by adding new members to existing families and making new families and groups. While classifying the new complexes, we also realised the emergence of new groups and families. The new group observed was where β-propeller was seen to interact with DNA. There were 34 SCOP folds which were observed to be present in the complexes of both old and new classifications, whereas 28 folds are present exclusively in the new complexes. Some new families noticed were NarL transcription factor, Z-α DNA binding proteins, Forkhead transcription factor, AP2 protein, Methyl CpG binding protein etc.

Conclusions: Our results suggest that with the increasing number of availability of DNA-protein complexes in Protein Data Bank, the number of families in the classification increased by approximately three fold. The folds present exclusively in newly classified complexes is suggestive of inclusion of proteins with new function in new classification, the most populated of which are the folds responsible for DNA damage repair. The proposed re-visited classification can be used to perform genome-wide surveys in the genomes of interest for the presence of DNA-binding proteins. Further analysis of these complexes can aid in developing algorithms for identifying DNA-binding proteins and their family members from mere sequence information.

Show MeSH
Related in: MedlinePlus