Limits...
Subfamily specific conservation profiles for proteins based on n-gram patterns.

Vries JK, Liu X - BMC Bioinformatics (2008)

Bottom Line: The speed of the new algorithm was comparable to the multiple alignment approach.This is useful when the subfamily contains multiple domains with different levels of representation in protein databases.It may also be applicable when the subfamily sample size is too small for the multiple alignment approach.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computational Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15213, USA. vries@ccbb.pitt.edu

ABSTRACT

Background: A new algorithm has been developed for generating conservation profiles that reflect the evolutionary history of the subfamily associated with a query sequence. It is based on n-gram patterns (NP{n,m}) which are sets of n residues and m wildcards in windows of size n+m. The generation of conservation profiles is treated as a signal-to-noise problem where the signal is the count of n-gram patterns in target sequences that are similar to the query sequence and the noise is the count over all target sequences. The signal is differentiated from the noise by applying singular value decomposition to sets of target sequences rank ordered by similarity with respect to the query.

Results: The new algorithm was used to construct 4,248 profiles from 120 randomly selected Pfam-A families. These were compared to profiles generated from multiple alignments using the consensus approach. The two profiles were similar whenever the subfamily associated with the query sequence was well represented in the multiple alignment. It was possible to construct subfamily specific conservation profiles using the new algorithm for subfamilies with as few as five members. The speed of the new algorithm was comparable to the multiple alignment approach.

Conclusion: Subfamily specific conservation profiles can be generated by the new algorithm without aprioi knowledge of family relationships or domain architecture. This is useful when the subfamily contains multiple domains with different levels of representation in protein databases. It may also be applicable when the subfamily sample size is too small for the multiple alignment approach.

Show MeSH
Comparison of the invariant conservation profile (ICP) for P00918 generated from the reduced SPT data set (~40 K) with the ICP generated from the full SPT data set (~2.1 M). The reduced SPT set was created from an inverted index of trigrams. The rmsd difference between the traces is only 0.042. This was typical for the 4,248 traces in the SPT data set.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2267698&req=5

Figure 4: Comparison of the invariant conservation profile (ICP) for P00918 generated from the reduced SPT data set (~40 K) with the ICP generated from the full SPT data set (~2.1 M). The reduced SPT set was created from an inverted index of trigrams. The rmsd difference between the traces is only 0.042. This was typical for the 4,248 traces in the SPT data set.

Mentions: Comparison of the NP{4,2} patterns in a query sequence with all sequences in the SPT data set (~2.1 M) is costly from the computational standpoint. Profile generation using a single processor requires 45–60 minutes. If the average Pfam-A family only has 50–100 members, most of the processed sequences represent noise. Since noise is distributed randomly, its variance should not change as long as the sample is large enough from a statistical point of view. If most of the sequences representing noise could be eliminated up front, the computational time would decrease significantly without affecting the variance associated with the signal eigenvector (first eigenvector). Using the multiple alignments for the 8183 families in Pfam-A, we determined the probability of finding at least one n-gram (contiguous run of n residues) in common between all family members for 1 ≤ n ≤ 4. It should be noted that this process was not related to NP{4,2} patterns. The objective was to lower the computational burden by eliminating up front a significant percentage of the sequences that had no chance of containing family members. The idea was to find an n-gram that was present in all family members and to eliminate processing on all sequences that did not contain that n-gram. The size of the n-gram was critical because smaller n-grams are more likely to be found in all family members, but larger n-grams were less likely to be found by random chance. The best balance was achieved with trigrams where n = 3. Approximately 96% of all family members over the 8183 Pfam-A families had at least one trigram in common, while the expectation of finding a trigram match by random chance in the SPT data set was approximately 275. When n was smaller than 3, the expectation of finding a match in the SPT data set by random chance was too high to be useful. When n was larger than 3, the coverage for family members was too low. An inverted index was created mapping the trigrams in the SPT data set to integers representing UniProt accession numbers. A preprocessor for queries was constructed which looked up each trigram in the query sequence while incrementing the integer position of its UniProt accession number in a collection sequence. Sorting this buffer in descending order based on trigram hits and taking the first 40,000 members (top 2%) captured the majority of the family while eliminating 98% of the noise. This decreased the processing time by two orders of magnitude. Computational times of approximately 1 minute were achieved on conventional CPUs. Comparison of profiles generated with the complete and reduced SPT data set over a selection of 4,248 queries showed an average rmsd difference of less than 5%. This is illustrated for the P00918 profile in Figure 4. The example shown here is typical for all the ICPs generated from the SPT data set. The reduced SPT data was used in the studies comparing ICP conservation profiles to profiles generated from MSAs. The time complexity for the NPLA algorithm is comparable to the time complexity of the standard sequence search algorithms such as PSI-Blast. The time complexity of running PSI-BLAST on the SPT database is O(nl), where n is the number of sequences in the SPT database and l is the length of the longest sequence [21,22]. The time complexity of running the NPLA algorithm is O(n'l), where n' is the top 2% percent of the sequences in the SPT database.


Subfamily specific conservation profiles for proteins based on n-gram patterns.

Vries JK, Liu X - BMC Bioinformatics (2008)

Comparison of the invariant conservation profile (ICP) for P00918 generated from the reduced SPT data set (~40 K) with the ICP generated from the full SPT data set (~2.1 M). The reduced SPT set was created from an inverted index of trigrams. The rmsd difference between the traces is only 0.042. This was typical for the 4,248 traces in the SPT data set.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2267698&req=5

Figure 4: Comparison of the invariant conservation profile (ICP) for P00918 generated from the reduced SPT data set (~40 K) with the ICP generated from the full SPT data set (~2.1 M). The reduced SPT set was created from an inverted index of trigrams. The rmsd difference between the traces is only 0.042. This was typical for the 4,248 traces in the SPT data set.
Mentions: Comparison of the NP{4,2} patterns in a query sequence with all sequences in the SPT data set (~2.1 M) is costly from the computational standpoint. Profile generation using a single processor requires 45–60 minutes. If the average Pfam-A family only has 50–100 members, most of the processed sequences represent noise. Since noise is distributed randomly, its variance should not change as long as the sample is large enough from a statistical point of view. If most of the sequences representing noise could be eliminated up front, the computational time would decrease significantly without affecting the variance associated with the signal eigenvector (first eigenvector). Using the multiple alignments for the 8183 families in Pfam-A, we determined the probability of finding at least one n-gram (contiguous run of n residues) in common between all family members for 1 ≤ n ≤ 4. It should be noted that this process was not related to NP{4,2} patterns. The objective was to lower the computational burden by eliminating up front a significant percentage of the sequences that had no chance of containing family members. The idea was to find an n-gram that was present in all family members and to eliminate processing on all sequences that did not contain that n-gram. The size of the n-gram was critical because smaller n-grams are more likely to be found in all family members, but larger n-grams were less likely to be found by random chance. The best balance was achieved with trigrams where n = 3. Approximately 96% of all family members over the 8183 Pfam-A families had at least one trigram in common, while the expectation of finding a trigram match by random chance in the SPT data set was approximately 275. When n was smaller than 3, the expectation of finding a match in the SPT data set by random chance was too high to be useful. When n was larger than 3, the coverage for family members was too low. An inverted index was created mapping the trigrams in the SPT data set to integers representing UniProt accession numbers. A preprocessor for queries was constructed which looked up each trigram in the query sequence while incrementing the integer position of its UniProt accession number in a collection sequence. Sorting this buffer in descending order based on trigram hits and taking the first 40,000 members (top 2%) captured the majority of the family while eliminating 98% of the noise. This decreased the processing time by two orders of magnitude. Computational times of approximately 1 minute were achieved on conventional CPUs. Comparison of profiles generated with the complete and reduced SPT data set over a selection of 4,248 queries showed an average rmsd difference of less than 5%. This is illustrated for the P00918 profile in Figure 4. The example shown here is typical for all the ICPs generated from the SPT data set. The reduced SPT data was used in the studies comparing ICP conservation profiles to profiles generated from MSAs. The time complexity for the NPLA algorithm is comparable to the time complexity of the standard sequence search algorithms such as PSI-Blast. The time complexity of running PSI-BLAST on the SPT database is O(nl), where n is the number of sequences in the SPT database and l is the length of the longest sequence [21,22]. The time complexity of running the NPLA algorithm is O(n'l), where n' is the top 2% percent of the sequences in the SPT database.

Bottom Line: The speed of the new algorithm was comparable to the multiple alignment approach.This is useful when the subfamily contains multiple domains with different levels of representation in protein databases.It may also be applicable when the subfamily sample size is too small for the multiple alignment approach.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computational Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15213, USA. vries@ccbb.pitt.edu

ABSTRACT

Background: A new algorithm has been developed for generating conservation profiles that reflect the evolutionary history of the subfamily associated with a query sequence. It is based on n-gram patterns (NP{n,m}) which are sets of n residues and m wildcards in windows of size n+m. The generation of conservation profiles is treated as a signal-to-noise problem where the signal is the count of n-gram patterns in target sequences that are similar to the query sequence and the noise is the count over all target sequences. The signal is differentiated from the noise by applying singular value decomposition to sets of target sequences rank ordered by similarity with respect to the query.

Results: The new algorithm was used to construct 4,248 profiles from 120 randomly selected Pfam-A families. These were compared to profiles generated from multiple alignments using the consensus approach. The two profiles were similar whenever the subfamily associated with the query sequence was well represented in the multiple alignment. It was possible to construct subfamily specific conservation profiles using the new algorithm for subfamilies with as few as five members. The speed of the new algorithm was comparable to the multiple alignment approach.

Conclusion: Subfamily specific conservation profiles can be generated by the new algorithm without aprioi knowledge of family relationships or domain architecture. This is useful when the subfamily contains multiple domains with different levels of representation in protein databases. It may also be applicable when the subfamily sample size is too small for the multiple alignment approach.

Show MeSH