Limits...
PairsDB atlas of protein sequence space.

Heger A, Korpelainen E, Hupponen T, Mattila K, Ollikainen V, Holm L - Nucleic Acids Res. (2007)

Bottom Line: Instead of running BLAST or PSI-BLAST individually on each request, results are retrieved instantaneously from a database of pre-computed alignments.The data is stored in a MySQL relational database.Moreover, query-anchored stacked sequence alignments, profiles and consensus sequences are useful in studies of sequence conservation patterns for clues about possible functional sites.

View Article: PubMed Central - PubMed

Affiliation: MRC Functional Genetics Unit, University of Oxford, UK.

ABSTRACT
Sequence similarity/database searching is a cornerstone of molecular biology. PairsDB is a database intended to make exploring protein sequences and their similarity relationships quick and easy. Behind PairsDB is a comprehensive collection of protein sequences and BLAST and PSI-BLAST alignments between them. Instead of running BLAST or PSI-BLAST individually on each request, results are retrieved instantaneously from a database of pre-computed alignments. Filtering options allow you to find a set of sequences satisfying a set of criteria-for example, all human proteins with solved structure and without transmembrane segments. PairsDB is continually updated and covers all sequences in Uniprot. The data is stored in a MySQL relational database. Data files will be made available for download at ftp://nic.funet.fi/pub/sci/molbio. PairsDB can also be accessed interactively at http://pairsdb.csc.fi. PairsDB data is a valuable platform to build various downstream automated analysis pipelines. For example, the graph of all-against-all similarity relationships is the starting point for clustering protein families, delineating domains, improving alignment accuracy by consistency measures, and defining orthologous genes. Moreover, query-anchored stacked sequence alignments, profiles and consensus sequences are useful in studies of sequence conservation patterns for clues about possible functional sites.

Show MeSH
(a) Data storage model and (b) Functionality of the PairsDB web server.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2238971&req=5

Figure 1: (a) Data storage model and (b) Functionality of the PairsDB web server.

Mentions: We compute the all-on-all alignment data in representative subsets of the whole sequence dataset. Sequences not in the representative subset are aligned only to the representatives (Figure 1a). All sequences are more than 90% identical to their representatives in the case of a BLAST search, and more than 40% identical in the case of a PSI-BLAST search. Using representative subsets saves search time and storage space quadratically and without loss of information (11). As NRDBxx clusters have a skewed size distribution, our scheme leads to huge savings in the largest protein families. The theoretical, and in our experience reasonable, assumption here is that alignments to a third party by sequences which are more than 90% or more than 40% identical to each other will be consistent. The alignment between two proteins of interest can be reconstructed on the fly by transitive alignment using their respective representatives as intermediates.Figure 1.


PairsDB atlas of protein sequence space.

Heger A, Korpelainen E, Hupponen T, Mattila K, Ollikainen V, Holm L - Nucleic Acids Res. (2007)

(a) Data storage model and (b) Functionality of the PairsDB web server.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2238971&req=5

Figure 1: (a) Data storage model and (b) Functionality of the PairsDB web server.
Mentions: We compute the all-on-all alignment data in representative subsets of the whole sequence dataset. Sequences not in the representative subset are aligned only to the representatives (Figure 1a). All sequences are more than 90% identical to their representatives in the case of a BLAST search, and more than 40% identical in the case of a PSI-BLAST search. Using representative subsets saves search time and storage space quadratically and without loss of information (11). As NRDBxx clusters have a skewed size distribution, our scheme leads to huge savings in the largest protein families. The theoretical, and in our experience reasonable, assumption here is that alignments to a third party by sequences which are more than 90% or more than 40% identical to each other will be consistent. The alignment between two proteins of interest can be reconstructed on the fly by transitive alignment using their respective representatives as intermediates.Figure 1.

Bottom Line: Instead of running BLAST or PSI-BLAST individually on each request, results are retrieved instantaneously from a database of pre-computed alignments.The data is stored in a MySQL relational database.Moreover, query-anchored stacked sequence alignments, profiles and consensus sequences are useful in studies of sequence conservation patterns for clues about possible functional sites.

View Article: PubMed Central - PubMed

Affiliation: MRC Functional Genetics Unit, University of Oxford, UK.

ABSTRACT
Sequence similarity/database searching is a cornerstone of molecular biology. PairsDB is a database intended to make exploring protein sequences and their similarity relationships quick and easy. Behind PairsDB is a comprehensive collection of protein sequences and BLAST and PSI-BLAST alignments between them. Instead of running BLAST or PSI-BLAST individually on each request, results are retrieved instantaneously from a database of pre-computed alignments. Filtering options allow you to find a set of sequences satisfying a set of criteria-for example, all human proteins with solved structure and without transmembrane segments. PairsDB is continually updated and covers all sequences in Uniprot. The data is stored in a MySQL relational database. Data files will be made available for download at ftp://nic.funet.fi/pub/sci/molbio. PairsDB can also be accessed interactively at http://pairsdb.csc.fi. PairsDB data is a valuable platform to build various downstream automated analysis pipelines. For example, the graph of all-against-all similarity relationships is the starting point for clustering protein families, delineating domains, improving alignment accuracy by consistency measures, and defining orthologous genes. Moreover, query-anchored stacked sequence alignments, profiles and consensus sequences are useful in studies of sequence conservation patterns for clues about possible functional sites.

Show MeSH