Limits...
Subfamily specific conservation profiles for proteins based on n-gram patterns.

Vries JK, Liu X - BMC Bioinformatics (2008)

Bottom Line: The speed of the new algorithm was comparable to the multiple alignment approach.This is useful when the subfamily contains multiple domains with different levels of representation in protein databases.It may also be applicable when the subfamily sample size is too small for the multiple alignment approach.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computational Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15213, USA. vries@ccbb.pitt.edu

ABSTRACT

Background: A new algorithm has been developed for generating conservation profiles that reflect the evolutionary history of the subfamily associated with a query sequence. It is based on n-gram patterns (NP{n,m}) which are sets of n residues and m wildcards in windows of size n+m. The generation of conservation profiles is treated as a signal-to-noise problem where the signal is the count of n-gram patterns in target sequences that are similar to the query sequence and the noise is the count over all target sequences. The signal is differentiated from the noise by applying singular value decomposition to sets of target sequences rank ordered by similarity with respect to the query.

Results: The new algorithm was used to construct 4,248 profiles from 120 randomly selected Pfam-A families. These were compared to profiles generated from multiple alignments using the consensus approach. The two profiles were similar whenever the subfamily associated with the query sequence was well represented in the multiple alignment. It was possible to construct subfamily specific conservation profiles using the new algorithm for subfamilies with as few as five members. The speed of the new algorithm was comparable to the multiple alignment approach.

Conclusion: Subfamily specific conservation profiles can be generated by the new algorithm without aprioi knowledge of family relationships or domain architecture. This is useful when the subfamily contains multiple domains with different levels of representation in protein databases. It may also be applicable when the subfamily sample size is too small for the multiple alignment approach.

Show MeSH

Related in: MedlinePlus

Best rmsd fits for the four family size ranges. All rmsds are two standard deviations or better below the mean. For each family range, the best fit is shown on the left and the worst fit is shown on the right. (a,b) = S1000, (c,d) = S500, (e,f) = S100 and (g,h) = S30. The major peaks in the ICP traces are very close to the peaks in the MACP traces in all cases. This shows that a significant number of examples exist (N = 103) where the ICP and the MACP are the same.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2267698&req=5

Figure 5: Best rmsd fits for the four family size ranges. All rmsds are two standard deviations or better below the mean. For each family range, the best fit is shown on the left and the worst fit is shown on the right. (a,b) = S1000, (c,d) = S500, (e,f) = S100 and (g,h) = S30. The major peaks in the ICP traces are very close to the peaks in the MACP traces in all cases. This shows that a significant number of examples exist (N = 103) where the ICP and the MACP are the same.

Mentions: The comparison between ICP and MACP for the four samples as a function of level in the cluster hierarchy is shown in Table 1. At level 1 the multiple alignment contains all the Pfam-A family members. As the level in the hierarchy increases, the multiple alignment samples get smaller and the pairwise difference in evolutionary distance between members decreases. The N30–N1000 columns show the number of clusters at each level that contained at least 20 members. The RMSD30–RMSD1000 columns show the average rmsd difference between the ICP and MACP traces for the best rmsd fits at each level. A strong correlation between hierarchical level (i.e., pairwise consistency) and rmsd difference is apparent. The ICP and the MACP approach one another as the pairwise difference between the cluster members decreases. If the query sequence were a member of a distinct subfamily and the multiple alignment were limited to members of that subfamily, this would be the expected behavior. This table also implies that a subset of MACP traces should exist that are the same as their corresponding ICP traces. This should happen whenever the multiple alignment used to generate the MACP is dominated by a single subfamily and the query sequence used to generate the ICP is a member of that subfamily. Figure 5 shows that a significant number of examples exist in all four test samples where the ICP and MACP traces have the same peak and amplitude patterns. The rmsd difference between the ICP and the MACP was determined for each member in each of the four samples (N = 30, N = 100, N = 500, N = 1000). A list rank ordered by ascending rmsd difference was created for each sample. For each sample, representatives that were more than 2 standard deviations below the mean were selected for plotting. The left sided traces show the best fit for each sample. The right sided traces show the fit at the 2 standard deviation limit. The total number of traces more than 2 standard deviations below the mean for all 4 samples was 103.


Subfamily specific conservation profiles for proteins based on n-gram patterns.

Vries JK, Liu X - BMC Bioinformatics (2008)

Best rmsd fits for the four family size ranges. All rmsds are two standard deviations or better below the mean. For each family range, the best fit is shown on the left and the worst fit is shown on the right. (a,b) = S1000, (c,d) = S500, (e,f) = S100 and (g,h) = S30. The major peaks in the ICP traces are very close to the peaks in the MACP traces in all cases. This shows that a significant number of examples exist (N = 103) where the ICP and the MACP are the same.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2267698&req=5

Figure 5: Best rmsd fits for the four family size ranges. All rmsds are two standard deviations or better below the mean. For each family range, the best fit is shown on the left and the worst fit is shown on the right. (a,b) = S1000, (c,d) = S500, (e,f) = S100 and (g,h) = S30. The major peaks in the ICP traces are very close to the peaks in the MACP traces in all cases. This shows that a significant number of examples exist (N = 103) where the ICP and the MACP are the same.
Mentions: The comparison between ICP and MACP for the four samples as a function of level in the cluster hierarchy is shown in Table 1. At level 1 the multiple alignment contains all the Pfam-A family members. As the level in the hierarchy increases, the multiple alignment samples get smaller and the pairwise difference in evolutionary distance between members decreases. The N30–N1000 columns show the number of clusters at each level that contained at least 20 members. The RMSD30–RMSD1000 columns show the average rmsd difference between the ICP and MACP traces for the best rmsd fits at each level. A strong correlation between hierarchical level (i.e., pairwise consistency) and rmsd difference is apparent. The ICP and the MACP approach one another as the pairwise difference between the cluster members decreases. If the query sequence were a member of a distinct subfamily and the multiple alignment were limited to members of that subfamily, this would be the expected behavior. This table also implies that a subset of MACP traces should exist that are the same as their corresponding ICP traces. This should happen whenever the multiple alignment used to generate the MACP is dominated by a single subfamily and the query sequence used to generate the ICP is a member of that subfamily. Figure 5 shows that a significant number of examples exist in all four test samples where the ICP and MACP traces have the same peak and amplitude patterns. The rmsd difference between the ICP and the MACP was determined for each member in each of the four samples (N = 30, N = 100, N = 500, N = 1000). A list rank ordered by ascending rmsd difference was created for each sample. For each sample, representatives that were more than 2 standard deviations below the mean were selected for plotting. The left sided traces show the best fit for each sample. The right sided traces show the fit at the 2 standard deviation limit. The total number of traces more than 2 standard deviations below the mean for all 4 samples was 103.

Bottom Line: The speed of the new algorithm was comparable to the multiple alignment approach.This is useful when the subfamily contains multiple domains with different levels of representation in protein databases.It may also be applicable when the subfamily sample size is too small for the multiple alignment approach.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computational Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15213, USA. vries@ccbb.pitt.edu

ABSTRACT

Background: A new algorithm has been developed for generating conservation profiles that reflect the evolutionary history of the subfamily associated with a query sequence. It is based on n-gram patterns (NP{n,m}) which are sets of n residues and m wildcards in windows of size n+m. The generation of conservation profiles is treated as a signal-to-noise problem where the signal is the count of n-gram patterns in target sequences that are similar to the query sequence and the noise is the count over all target sequences. The signal is differentiated from the noise by applying singular value decomposition to sets of target sequences rank ordered by similarity with respect to the query.

Results: The new algorithm was used to construct 4,248 profiles from 120 randomly selected Pfam-A families. These were compared to profiles generated from multiple alignments using the consensus approach. The two profiles were similar whenever the subfamily associated with the query sequence was well represented in the multiple alignment. It was possible to construct subfamily specific conservation profiles using the new algorithm for subfamilies with as few as five members. The speed of the new algorithm was comparable to the multiple alignment approach.

Conclusion: Subfamily specific conservation profiles can be generated by the new algorithm without aprioi knowledge of family relationships or domain architecture. This is useful when the subfamily contains multiple domains with different levels of representation in protein databases. It may also be applicable when the subfamily sample size is too small for the multiple alignment approach.

Show MeSH
Related in: MedlinePlus