Limits...
3PFDB+: improved search protocol and update for the identification of representatives of protein sequence domain families.

Joseph AP, Shingate P, Upadhyay AK, Sowdhamini R - Database (Oxford) (2014)

Bottom Line: This improved and updated database, 3PFDB+ generated in this study, provides representative sequences and profiles for PFAM families, with 13 519 family representatives having more than 90% family coverage.Representatives belonging to small families with short sequences are mainly associated with low coverage.Some of these outliers were also predicted to have different secondary structure contents, which reflect different putative structure or functional roles for these domain sequences.

View Article: PubMed Central - PubMed

Affiliation: National Centre for Biological Sciences (TIFR), GKVK Campus, Bellary Road, Bangalore 560065, Karnataka, India.

ABSTRACT
Protein domain families are usually classified on the basis of similarity of amino acid sequences. Selection of a single representative sequence for each family provides targets for structure determination or modeling and also enables fast sequence searches to associate new members to a family. Such a selection could be challenging since some of these domain families exhibit huge variation depending on the number of members in the family, the average family sequence length or the extent of sequence divergence within a family. We had earlier created 3PFDB database as a repository of best representative sequences, selected from each PFAM domain family on the basis of high coverage. In this study, we have improved the database using more efficient strategies for the initial generation of sequence profiles and implement two independent methods, FASSM and HMMER, for identifying family members. HMMER employs a global sequence similarity search, while FASSM relies on motif identification and matching. This improved and updated database, 3PFDB+ generated in this study, provides representative sequences and profiles for PFAM families, with 13 519 family representatives having more than 90% family coverage. The representative sequence is also highlighted in a two-dimensional plot, which reflects the relative divergence between family members. Representatives belonging to small families with short sequences are mainly associated with low coverage. The set of sequences not recognized by the family representative profiles, highlight several potential false or weak family associations in PFAM. Partial domains and fragments dominate such cases, along with sequences that are highly diverged or different from other family members. Some of these outliers were also predicted to have different secondary structure contents, which reflect different putative structure or functional roles for these domain sequences. Database URL: http://caps.ncbs.res.in/3pfdbplus/.

Show MeSH
Coverage obtained with best representatives identified in 3PFDB+. (A) Comparison of family coverage of best representatives identified by HMMER (3) and FASSM (8). For the 153 families where the representatives by both methods had coverage <90%, (B) the distribution of average family sequence length and (C) family size, are plotted.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3974335&req=5

bau026-F3: Coverage obtained with best representatives identified in 3PFDB+. (A) Comparison of family coverage of best representatives identified by HMMER (3) and FASSM (8). For the 153 families where the representatives by both methods had coverage <90%, (B) the distribution of average family sequence length and (C) family size, are plotted.

Mentions: The BRSs, identified by HMMER and FASSM, were compared for their family coverage. In the case of HMMER, 13 214 representatives retain family coverage of more than 90%, while 12 122 representatives identified using FASSM exhibit coverage more than 90% (Figure 3A). For 3473 families, the same BRS were identified by both HMMER and FASSM. The coverage obtained for representatives, identified by HMMER, is better than that chosen using FASSM for 33.5% of families. However, FASSM-based representatives retained better coverage in only 16.6% of cases.Figure 3.


3PFDB+: improved search protocol and update for the identification of representatives of protein sequence domain families.

Joseph AP, Shingate P, Upadhyay AK, Sowdhamini R - Database (Oxford) (2014)

Coverage obtained with best representatives identified in 3PFDB+. (A) Comparison of family coverage of best representatives identified by HMMER (3) and FASSM (8). For the 153 families where the representatives by both methods had coverage <90%, (B) the distribution of average family sequence length and (C) family size, are plotted.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3974335&req=5

bau026-F3: Coverage obtained with best representatives identified in 3PFDB+. (A) Comparison of family coverage of best representatives identified by HMMER (3) and FASSM (8). For the 153 families where the representatives by both methods had coverage <90%, (B) the distribution of average family sequence length and (C) family size, are plotted.
Mentions: The BRSs, identified by HMMER and FASSM, were compared for their family coverage. In the case of HMMER, 13 214 representatives retain family coverage of more than 90%, while 12 122 representatives identified using FASSM exhibit coverage more than 90% (Figure 3A). For 3473 families, the same BRS were identified by both HMMER and FASSM. The coverage obtained for representatives, identified by HMMER, is better than that chosen using FASSM for 33.5% of families. However, FASSM-based representatives retained better coverage in only 16.6% of cases.Figure 3.

Bottom Line: This improved and updated database, 3PFDB+ generated in this study, provides representative sequences and profiles for PFAM families, with 13 519 family representatives having more than 90% family coverage.Representatives belonging to small families with short sequences are mainly associated with low coverage.Some of these outliers were also predicted to have different secondary structure contents, which reflect different putative structure or functional roles for these domain sequences.

View Article: PubMed Central - PubMed

Affiliation: National Centre for Biological Sciences (TIFR), GKVK Campus, Bellary Road, Bangalore 560065, Karnataka, India.

ABSTRACT
Protein domain families are usually classified on the basis of similarity of amino acid sequences. Selection of a single representative sequence for each family provides targets for structure determination or modeling and also enables fast sequence searches to associate new members to a family. Such a selection could be challenging since some of these domain families exhibit huge variation depending on the number of members in the family, the average family sequence length or the extent of sequence divergence within a family. We had earlier created 3PFDB database as a repository of best representative sequences, selected from each PFAM domain family on the basis of high coverage. In this study, we have improved the database using more efficient strategies for the initial generation of sequence profiles and implement two independent methods, FASSM and HMMER, for identifying family members. HMMER employs a global sequence similarity search, while FASSM relies on motif identification and matching. This improved and updated database, 3PFDB+ generated in this study, provides representative sequences and profiles for PFAM families, with 13 519 family representatives having more than 90% family coverage. The representative sequence is also highlighted in a two-dimensional plot, which reflects the relative divergence between family members. Representatives belonging to small families with short sequences are mainly associated with low coverage. The set of sequences not recognized by the family representative profiles, highlight several potential false or weak family associations in PFAM. Partial domains and fragments dominate such cases, along with sequences that are highly diverged or different from other family members. Some of these outliers were also predicted to have different secondary structure contents, which reflect different putative structure or functional roles for these domain sequences. Database URL: http://caps.ncbs.res.in/3pfdbplus/.

Show MeSH