Limits...
Using sequence similarity networks for visualization of relationships across diverse protein superfamilies.

Atkinson HJ, Morris JH, Ferrin TE, Babbitt PC - PLoS ONE (2009)

Bottom Line: In comparison to other primary methods, networks provide both a good representation of group-wise sequence similarity relationships and a strong visual and quantitative correlation with phylogenetic trees, while enabling analysis and visualization of much larger sets of sequences than trees or multiple sequence alignments can easily accommodate.We also define important limitations and caveats in the application of these networks.As a broadly accessible and effective tool for the exploration of protein superfamilies, sequence similarity networks show great potential for generating testable hypotheses about protein structure-function relationships.

View Article: PubMed Central - PubMed

Affiliation: Graduate Program in Biological and Medical Informatics, University of California San Francisco, San Francisco, California, United States of America.

ABSTRACT
The dramatic increase in heterogeneous types of biological data--in particular, the abundance of new protein sequences--requires fast and user-friendly methods for organizing this information in a way that enables functional inference. The most widely used strategy to link sequence or structure to function, homology-based function prediction, relies on the fundamental assumption that sequence or structural similarity implies functional similarity. New tools that extend this approach are still urgently needed to associate sequence data with biological information in ways that accommodate the real complexity of the problem, while being accessible to experimental as well as computational biologists. To address this, we have examined the application of sequence similarity networks for visualizing functional trends across protein superfamilies from the context of sequence similarity. Using three large groups of homologous proteins of varying types of structural and functional diversity--GPCRs and kinases from humans, and the crotonase superfamily of enzymes--we show that overlaying networks with orthogonal information is a powerful approach for observing functional themes and revealing outliers. In comparison to other primary methods, networks provide both a good representation of group-wise sequence similarity relationships and a strong visual and quantitative correlation with phylogenetic trees, while enabling analysis and visualization of much larger sets of sequences than trees or multiple sequence alignments can easily accommodate. We also define important limitations and caveats in the application of these networks. As a broadly accessible and effective tool for the exploration of protein superfamilies, sequence similarity networks show great potential for generating testable hypotheses about protein structure-function relationships.

Show MeSH

Related in: MedlinePlus

Sequence similarity network topology changes in a predictable way with the stringency of the threshold.A. Thresholded sequence similarity networks represent sequences as nodes (circles) and all pairwise sequence relationships (alignments) better than a threshold as edges (lines). The same network, depicting three simulated protein classes, is shown here at four different thresholds. At stringent thresholds, the sequences break up into disconnected groups; within each group the sequences are highly similar. The relative positioning of disconnected groups has no meaning, while the lengths of connecting edges tend to correlate with the relative dissimilarities of each pair of sequences. As the threshold is relaxed and edges associated with less significant relationships are added to the network, groups merge together and eventually become completely interconnected. B. Simulated dendrogram for a sequence set that might give rise to the network in A.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2631154&req=5

pone-0004345-g001: Sequence similarity network topology changes in a predictable way with the stringency of the threshold.A. Thresholded sequence similarity networks represent sequences as nodes (circles) and all pairwise sequence relationships (alignments) better than a threshold as edges (lines). The same network, depicting three simulated protein classes, is shown here at four different thresholds. At stringent thresholds, the sequences break up into disconnected groups; within each group the sequences are highly similar. The relative positioning of disconnected groups has no meaning, while the lengths of connecting edges tend to correlate with the relative dissimilarities of each pair of sequences. As the threshold is relaxed and edges associated with less significant relationships are added to the network, groups merge together and eventually become completely interconnected. B. Simulated dendrogram for a sequence set that might give rise to the network in A.

Mentions: Over the past two decades, there has been a disorderly explosion of biological data, exponentially increasing in volume with time. To keep pace with the broad classes of new sequence, structural, and functional data arising from compilations of genomic and proteomic data in particular, many powerful approaches have been developed for unearthing meaningful themes and hypotheses from within the jumble. Yet there is still a critical need for improved techniques enabling fast and comprehensive analysis of large sequence data sets, especially to access the biologically useful context that can be extracted from this information. There is a particular demand for easy-to-use techniques to aid experimental biologists in finding useful starting points for analyzing diverse superfamilies of proteins. Here we address one of these techniques, sequence similarity networks (Fig. 1). A relatively new application of methods commonly used to summarize protein-protein interactions on a large scale[1], sequence similarity networks—here, in which the interrelationships between proteins are described as a collection of independent pairwise alignments between sequences—represent an attractive adjunct approach to multiple sequence alignments and phylogenetic trees. Moreover, they offer several important capabilities unavailable to these methods. First, they provide a fast and easy to compute framework for observing relationships among very large sets of evolutionarily related proteins; more importantly, when visualized they also allow the perception of trends in orthogonal information—viz., function-related information—mapped onto the context of sequence similarity. Because they provide access to these relationships in an intuitively accessible manner and are easy to create and manipulate, these networks fill a need that is not currently well-addressed by other tools. By enabling the visualization of extremely large sets of related sequences, networks provide advantages unmet by phylogenetic trees, particularly in showing all relationships that score above a user-defined similarity cut-off rather than only the small number of optimally scoring connections. Also, for the same amount of computation, a much larger set of sequences can be analyzed using a network than could be used to infer a tree. Furthermore, there are restrictions on the number of sequences that can be usefully considered in generating a multiple sequence alignment, in part due to the practical limitations of viewing alignments of hundreds of sequences. The corresponding benefit of visualizing a sequence similarity network, rather than analyzing it numerically, is that the displayed network can be overlaid with as many types of derived and orthogonal information as spring to mind. The network can then be interactively explored to see how these different features coalesce into trends (or don't) when viewed in the context of sequence similarity. Additionally, using interactive software to visualize the networks (e.g. [1]) and to link to other types of information such as three-dimensional structures (e.g. [2]) allows the evaluation of individual and sets of edges, enabling an informed researcher to decide how much to trust the relationships implied by the network structure.


Using sequence similarity networks for visualization of relationships across diverse protein superfamilies.

Atkinson HJ, Morris JH, Ferrin TE, Babbitt PC - PLoS ONE (2009)

Sequence similarity network topology changes in a predictable way with the stringency of the threshold.A. Thresholded sequence similarity networks represent sequences as nodes (circles) and all pairwise sequence relationships (alignments) better than a threshold as edges (lines). The same network, depicting three simulated protein classes, is shown here at four different thresholds. At stringent thresholds, the sequences break up into disconnected groups; within each group the sequences are highly similar. The relative positioning of disconnected groups has no meaning, while the lengths of connecting edges tend to correlate with the relative dissimilarities of each pair of sequences. As the threshold is relaxed and edges associated with less significant relationships are added to the network, groups merge together and eventually become completely interconnected. B. Simulated dendrogram for a sequence set that might give rise to the network in A.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2631154&req=5

pone-0004345-g001: Sequence similarity network topology changes in a predictable way with the stringency of the threshold.A. Thresholded sequence similarity networks represent sequences as nodes (circles) and all pairwise sequence relationships (alignments) better than a threshold as edges (lines). The same network, depicting three simulated protein classes, is shown here at four different thresholds. At stringent thresholds, the sequences break up into disconnected groups; within each group the sequences are highly similar. The relative positioning of disconnected groups has no meaning, while the lengths of connecting edges tend to correlate with the relative dissimilarities of each pair of sequences. As the threshold is relaxed and edges associated with less significant relationships are added to the network, groups merge together and eventually become completely interconnected. B. Simulated dendrogram for a sequence set that might give rise to the network in A.
Mentions: Over the past two decades, there has been a disorderly explosion of biological data, exponentially increasing in volume with time. To keep pace with the broad classes of new sequence, structural, and functional data arising from compilations of genomic and proteomic data in particular, many powerful approaches have been developed for unearthing meaningful themes and hypotheses from within the jumble. Yet there is still a critical need for improved techniques enabling fast and comprehensive analysis of large sequence data sets, especially to access the biologically useful context that can be extracted from this information. There is a particular demand for easy-to-use techniques to aid experimental biologists in finding useful starting points for analyzing diverse superfamilies of proteins. Here we address one of these techniques, sequence similarity networks (Fig. 1). A relatively new application of methods commonly used to summarize protein-protein interactions on a large scale[1], sequence similarity networks—here, in which the interrelationships between proteins are described as a collection of independent pairwise alignments between sequences—represent an attractive adjunct approach to multiple sequence alignments and phylogenetic trees. Moreover, they offer several important capabilities unavailable to these methods. First, they provide a fast and easy to compute framework for observing relationships among very large sets of evolutionarily related proteins; more importantly, when visualized they also allow the perception of trends in orthogonal information—viz., function-related information—mapped onto the context of sequence similarity. Because they provide access to these relationships in an intuitively accessible manner and are easy to create and manipulate, these networks fill a need that is not currently well-addressed by other tools. By enabling the visualization of extremely large sets of related sequences, networks provide advantages unmet by phylogenetic trees, particularly in showing all relationships that score above a user-defined similarity cut-off rather than only the small number of optimally scoring connections. Also, for the same amount of computation, a much larger set of sequences can be analyzed using a network than could be used to infer a tree. Furthermore, there are restrictions on the number of sequences that can be usefully considered in generating a multiple sequence alignment, in part due to the practical limitations of viewing alignments of hundreds of sequences. The corresponding benefit of visualizing a sequence similarity network, rather than analyzing it numerically, is that the displayed network can be overlaid with as many types of derived and orthogonal information as spring to mind. The network can then be interactively explored to see how these different features coalesce into trends (or don't) when viewed in the context of sequence similarity. Additionally, using interactive software to visualize the networks (e.g. [1]) and to link to other types of information such as three-dimensional structures (e.g. [2]) allows the evaluation of individual and sets of edges, enabling an informed researcher to decide how much to trust the relationships implied by the network structure.

Bottom Line: In comparison to other primary methods, networks provide both a good representation of group-wise sequence similarity relationships and a strong visual and quantitative correlation with phylogenetic trees, while enabling analysis and visualization of much larger sets of sequences than trees or multiple sequence alignments can easily accommodate.We also define important limitations and caveats in the application of these networks.As a broadly accessible and effective tool for the exploration of protein superfamilies, sequence similarity networks show great potential for generating testable hypotheses about protein structure-function relationships.

View Article: PubMed Central - PubMed

Affiliation: Graduate Program in Biological and Medical Informatics, University of California San Francisco, San Francisco, California, United States of America.

ABSTRACT
The dramatic increase in heterogeneous types of biological data--in particular, the abundance of new protein sequences--requires fast and user-friendly methods for organizing this information in a way that enables functional inference. The most widely used strategy to link sequence or structure to function, homology-based function prediction, relies on the fundamental assumption that sequence or structural similarity implies functional similarity. New tools that extend this approach are still urgently needed to associate sequence data with biological information in ways that accommodate the real complexity of the problem, while being accessible to experimental as well as computational biologists. To address this, we have examined the application of sequence similarity networks for visualizing functional trends across protein superfamilies from the context of sequence similarity. Using three large groups of homologous proteins of varying types of structural and functional diversity--GPCRs and kinases from humans, and the crotonase superfamily of enzymes--we show that overlaying networks with orthogonal information is a powerful approach for observing functional themes and revealing outliers. In comparison to other primary methods, networks provide both a good representation of group-wise sequence similarity relationships and a strong visual and quantitative correlation with phylogenetic trees, while enabling analysis and visualization of much larger sets of sequences than trees or multiple sequence alignments can easily accommodate. We also define important limitations and caveats in the application of these networks. As a broadly accessible and effective tool for the exploration of protein superfamilies, sequence similarity networks show great potential for generating testable hypotheses about protein structure-function relationships.

Show MeSH
Related in: MedlinePlus