Limits...
3PFDB+: improved search protocol and update for the identification of representatives of protein sequence domain families.

Joseph AP, Shingate P, Upadhyay AK, Sowdhamini R - Database (Oxford) (2014)

Bottom Line: This improved and updated database, 3PFDB+ generated in this study, provides representative sequences and profiles for PFAM families, with 13 519 family representatives having more than 90% family coverage.Representatives belonging to small families with short sequences are mainly associated with low coverage.Some of these outliers were also predicted to have different secondary structure contents, which reflect different putative structure or functional roles for these domain sequences.

View Article: PubMed Central - PubMed

Affiliation: National Centre for Biological Sciences (TIFR), GKVK Campus, Bellary Road, Bangalore 560065, Karnataka, India.

ABSTRACT
Protein domain families are usually classified on the basis of similarity of amino acid sequences. Selection of a single representative sequence for each family provides targets for structure determination or modeling and also enables fast sequence searches to associate new members to a family. Such a selection could be challenging since some of these domain families exhibit huge variation depending on the number of members in the family, the average family sequence length or the extent of sequence divergence within a family. We had earlier created 3PFDB database as a repository of best representative sequences, selected from each PFAM domain family on the basis of high coverage. In this study, we have improved the database using more efficient strategies for the initial generation of sequence profiles and implement two independent methods, FASSM and HMMER, for identifying family members. HMMER employs a global sequence similarity search, while FASSM relies on motif identification and matching. This improved and updated database, 3PFDB+ generated in this study, provides representative sequences and profiles for PFAM families, with 13 519 family representatives having more than 90% family coverage. The representative sequence is also highlighted in a two-dimensional plot, which reflects the relative divergence between family members. Representatives belonging to small families with short sequences are mainly associated with low coverage. The set of sequences not recognized by the family representative profiles, highlight several potential false or weak family associations in PFAM. Partial domains and fragments dominate such cases, along with sequences that are highly diverged or different from other family members. Some of these outliers were also predicted to have different secondary structure contents, which reflect different putative structure or functional roles for these domain sequences. Database URL: http://caps.ncbs.res.in/3pfdbplus/.

Show MeSH
Sequences not recognized by representatives. The distribution of partial domains, low complexity sequences, compositionally biased sequences and diverged family members among those sequences not recognized as family members by the representatives. The family properties of those sequences not belonging to these categories are also highlighted.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3974335&req=5

bau026-F9: Sequences not recognized by representatives. The distribution of partial domains, low complexity sequences, compositionally biased sequences and diverged family members among those sequences not recognized as family members by the representatives. The family properties of those sequences not belonging to these categories are also highlighted.

Mentions: Sequences for which the above reasons could not be attributed for not being recognized as part of the family, comprise 31.4% of the sequences (Figure 9) and they span 270 families. The family properties of these sequences were analysed. 244 sequences have average family sequence length <50 residues and 442 sequences have family size >10 000. Below the 50% sequence identity level, the average family sequence identity is <15% for 97% of the sequences as they belong to highly diverse PFAM families. The clans covering maximum number of these sequences include C2H2/C2HC zinc fingers, Tetratricopeptide repeats, OB fold, Ankyrin repeats and P-loop NTPases. Most of these clans are known for short and diverse sequences. The reliability of family association for these sequences need to be verified further and the representative profiles need to be enriched with more sequences for genuine cases of family memberships. The list of sequences for which we could associate reasons for not recognizing the family BRP can be downloaded from the database and Supplementary Table S5 lists the number of sequences and the reasons attributed for weak family association.Figure 9.


3PFDB+: improved search protocol and update for the identification of representatives of protein sequence domain families.

Joseph AP, Shingate P, Upadhyay AK, Sowdhamini R - Database (Oxford) (2014)

Sequences not recognized by representatives. The distribution of partial domains, low complexity sequences, compositionally biased sequences and diverged family members among those sequences not recognized as family members by the representatives. The family properties of those sequences not belonging to these categories are also highlighted.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3974335&req=5

bau026-F9: Sequences not recognized by representatives. The distribution of partial domains, low complexity sequences, compositionally biased sequences and diverged family members among those sequences not recognized as family members by the representatives. The family properties of those sequences not belonging to these categories are also highlighted.
Mentions: Sequences for which the above reasons could not be attributed for not being recognized as part of the family, comprise 31.4% of the sequences (Figure 9) and they span 270 families. The family properties of these sequences were analysed. 244 sequences have average family sequence length <50 residues and 442 sequences have family size >10 000. Below the 50% sequence identity level, the average family sequence identity is <15% for 97% of the sequences as they belong to highly diverse PFAM families. The clans covering maximum number of these sequences include C2H2/C2HC zinc fingers, Tetratricopeptide repeats, OB fold, Ankyrin repeats and P-loop NTPases. Most of these clans are known for short and diverse sequences. The reliability of family association for these sequences need to be verified further and the representative profiles need to be enriched with more sequences for genuine cases of family memberships. The list of sequences for which we could associate reasons for not recognizing the family BRP can be downloaded from the database and Supplementary Table S5 lists the number of sequences and the reasons attributed for weak family association.Figure 9.

Bottom Line: This improved and updated database, 3PFDB+ generated in this study, provides representative sequences and profiles for PFAM families, with 13 519 family representatives having more than 90% family coverage.Representatives belonging to small families with short sequences are mainly associated with low coverage.Some of these outliers were also predicted to have different secondary structure contents, which reflect different putative structure or functional roles for these domain sequences.

View Article: PubMed Central - PubMed

Affiliation: National Centre for Biological Sciences (TIFR), GKVK Campus, Bellary Road, Bangalore 560065, Karnataka, India.

ABSTRACT
Protein domain families are usually classified on the basis of similarity of amino acid sequences. Selection of a single representative sequence for each family provides targets for structure determination or modeling and also enables fast sequence searches to associate new members to a family. Such a selection could be challenging since some of these domain families exhibit huge variation depending on the number of members in the family, the average family sequence length or the extent of sequence divergence within a family. We had earlier created 3PFDB database as a repository of best representative sequences, selected from each PFAM domain family on the basis of high coverage. In this study, we have improved the database using more efficient strategies for the initial generation of sequence profiles and implement two independent methods, FASSM and HMMER, for identifying family members. HMMER employs a global sequence similarity search, while FASSM relies on motif identification and matching. This improved and updated database, 3PFDB+ generated in this study, provides representative sequences and profiles for PFAM families, with 13 519 family representatives having more than 90% family coverage. The representative sequence is also highlighted in a two-dimensional plot, which reflects the relative divergence between family members. Representatives belonging to small families with short sequences are mainly associated with low coverage. The set of sequences not recognized by the family representative profiles, highlight several potential false or weak family associations in PFAM. Partial domains and fragments dominate such cases, along with sequences that are highly diverged or different from other family members. Some of these outliers were also predicted to have different secondary structure contents, which reflect different putative structure or functional roles for these domain sequences. Database URL: http://caps.ncbs.res.in/3pfdbplus/.

Show MeSH