Limits...
Using sequence similarity networks for visualization of relationships across diverse protein superfamilies.

Atkinson HJ, Morris JH, Ferrin TE, Babbitt PC - PLoS ONE (2009)

Bottom Line: In comparison to other primary methods, networks provide both a good representation of group-wise sequence similarity relationships and a strong visual and quantitative correlation with phylogenetic trees, while enabling analysis and visualization of much larger sets of sequences than trees or multiple sequence alignments can easily accommodate.We also define important limitations and caveats in the application of these networks.As a broadly accessible and effective tool for the exploration of protein superfamilies, sequence similarity networks show great potential for generating testable hypotheses about protein structure-function relationships.

View Article: PubMed Central - PubMed

Affiliation: Graduate Program in Biological and Medical Informatics, University of California San Francisco, San Francisco, California, United States of America.

ABSTRACT
The dramatic increase in heterogeneous types of biological data--in particular, the abundance of new protein sequences--requires fast and user-friendly methods for organizing this information in a way that enables functional inference. The most widely used strategy to link sequence or structure to function, homology-based function prediction, relies on the fundamental assumption that sequence or structural similarity implies functional similarity. New tools that extend this approach are still urgently needed to associate sequence data with biological information in ways that accommodate the real complexity of the problem, while being accessible to experimental as well as computational biologists. To address this, we have examined the application of sequence similarity networks for visualizing functional trends across protein superfamilies from the context of sequence similarity. Using three large groups of homologous proteins of varying types of structural and functional diversity--GPCRs and kinases from humans, and the crotonase superfamily of enzymes--we show that overlaying networks with orthogonal information is a powerful approach for observing functional themes and revealing outliers. In comparison to other primary methods, networks provide both a good representation of group-wise sequence similarity relationships and a strong visual and quantitative correlation with phylogenetic trees, while enabling analysis and visualization of much larger sets of sequences than trees or multiple sequence alignments can easily accommodate. We also define important limitations and caveats in the application of these networks. As a broadly accessible and effective tool for the exploration of protein superfamilies, sequence similarity networks show great potential for generating testable hypotheses about protein structure-function relationships.

Show MeSH

Related in: MedlinePlus

Crotonase superfamily: sequence similarity network from full-length sequences and from just the domain in common.The displayed networks all describe the pairwise relationships between 1,170 sequences from the crotonase superfamily. A. Network colored by family annotation, involving full-length sequences, thresholded at an E-value of 1×10−30. The worst edges displayed correspond to a median of 33% identity over alignments of 250 residues. B. The full-length network from A with nodes colored by sequence length and edges colored by alignment length. The same bifunctional enoyl-CoA hydratases (bECH) are marked with a dashed oval in B and C. C. Network colored by family annotation, involving just the crotonase domain, thresholded at 1×10−29. The worst edges displayed correspond to a median of 38% identity over alignments of 180 residues. D. 17 selected edges from the network in A and B. In the left panel, for each pair of sequences participating in an alignment, the log E-value versus the HMM used to define the crotonase domain is shown for each sequence—the single domain ECH (sECH) is on the bottom, and the second member of the pair is on the top—and the log BLAST E-value for the alignment between the two is in the middle. Two example bECH and sECH sequences (not alignments) are shown at the bottom of the left and middle panels. In the middle panel, each amino acid in each sequence from the 17 alignments is colored according to whether it was aligned to the crotonase domain defined by the HMM, and/or was paired to the other sequence in the BLAST alignment used to define an edge. Locations of six of these edges are marked in the enlarged view of the network in A in the right panel. The locations of the example bECH and sECH sequences are marked in the right panel using stars. See Tables S1 and S3 for quantitative comparisons.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2631154&req=5

pone-0004345-g005: Crotonase superfamily: sequence similarity network from full-length sequences and from just the domain in common.The displayed networks all describe the pairwise relationships between 1,170 sequences from the crotonase superfamily. A. Network colored by family annotation, involving full-length sequences, thresholded at an E-value of 1×10−30. The worst edges displayed correspond to a median of 33% identity over alignments of 250 residues. B. The full-length network from A with nodes colored by sequence length and edges colored by alignment length. The same bifunctional enoyl-CoA hydratases (bECH) are marked with a dashed oval in B and C. C. Network colored by family annotation, involving just the crotonase domain, thresholded at 1×10−29. The worst edges displayed correspond to a median of 38% identity over alignments of 180 residues. D. 17 selected edges from the network in A and B. In the left panel, for each pair of sequences participating in an alignment, the log E-value versus the HMM used to define the crotonase domain is shown for each sequence—the single domain ECH (sECH) is on the bottom, and the second member of the pair is on the top—and the log BLAST E-value for the alignment between the two is in the middle. Two example bECH and sECH sequences (not alignments) are shown at the bottom of the left and middle panels. In the middle panel, each amino acid in each sequence from the 17 alignments is colored according to whether it was aligned to the crotonase domain defined by the HMM, and/or was paired to the other sequence in the BLAST alignment used to define an edge. Locations of six of these edges are marked in the enlarged view of the network in A in the right panel. The locations of the example bECH and sECH sequences are marked in the right panel using stars. See Tables S1 and S3 for quantitative comparisons.

Mentions: In the course of considering the effect on network topology from using full-length sequences or only single domains, new groupings for the enoyl-CoA hydratase family were revealed, based on changes in domain architecture. (The enoyl-CoA hydratase family (ECH) is the constituent family for which the larger ECH superfamily was named.) Most proteins are composed of two or more domains, and the combination of multiple domains may modify the function of a multidomain protein relative to its single domain homologue[33]. A comparison of full-length and single domain sequences is especially relevant for highly divergent superfamilies in which domain organization may vary across different subgroups and influence network topology. Using the ECH superfamily (also known as the crotonase superfamily)[34] for these tests, we found that there can be a large degree of qualitative and quantitative correspondence between full-length sequence networks and domain-only networks when BLAST is used as a similarity metric, thanks to the fact that it calculates local alignments. Since the edges are representations of the local regions of similarity between sequences, as demonstrated in the agreement between Fig. 5A and 5C, the topology information does not change dramatically, whether the domain in common is isolated, or is embedded in a larger, multi-domain sequence (the correlation in displayed distances between the two networks is 0.868; see Table S3 for full statistics). However, in the domain-only network, the alignments leading to a similar topology are shorter and have higher sequence similarity, leading to differences in the associated E-values. Importantly, the sequence region defined as the crotonase domain by a hidden Markov model (HMM) and the region covered by the BLAST alignment are largely overlapping; see Fig. 5D for an illustration of selected alignments used to define edges from the full-length crotonase network. At significant BLAST alignment E-values—in particular, within the range included in the network in Fig. 5A, 5B, and 5D—the BLAST alignments tend to extend slightly beyond the crotonase domain. For the crotonase and GPCR superfamilies, the families of network topologies across a range of different E-value thresholds do not change substantially whether or not the domain-only sequence is used to construct a network with BLAST (data not shown).


Using sequence similarity networks for visualization of relationships across diverse protein superfamilies.

Atkinson HJ, Morris JH, Ferrin TE, Babbitt PC - PLoS ONE (2009)

Crotonase superfamily: sequence similarity network from full-length sequences and from just the domain in common.The displayed networks all describe the pairwise relationships between 1,170 sequences from the crotonase superfamily. A. Network colored by family annotation, involving full-length sequences, thresholded at an E-value of 1×10−30. The worst edges displayed correspond to a median of 33% identity over alignments of 250 residues. B. The full-length network from A with nodes colored by sequence length and edges colored by alignment length. The same bifunctional enoyl-CoA hydratases (bECH) are marked with a dashed oval in B and C. C. Network colored by family annotation, involving just the crotonase domain, thresholded at 1×10−29. The worst edges displayed correspond to a median of 38% identity over alignments of 180 residues. D. 17 selected edges from the network in A and B. In the left panel, for each pair of sequences participating in an alignment, the log E-value versus the HMM used to define the crotonase domain is shown for each sequence—the single domain ECH (sECH) is on the bottom, and the second member of the pair is on the top—and the log BLAST E-value for the alignment between the two is in the middle. Two example bECH and sECH sequences (not alignments) are shown at the bottom of the left and middle panels. In the middle panel, each amino acid in each sequence from the 17 alignments is colored according to whether it was aligned to the crotonase domain defined by the HMM, and/or was paired to the other sequence in the BLAST alignment used to define an edge. Locations of six of these edges are marked in the enlarged view of the network in A in the right panel. The locations of the example bECH and sECH sequences are marked in the right panel using stars. See Tables S1 and S3 for quantitative comparisons.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2631154&req=5

pone-0004345-g005: Crotonase superfamily: sequence similarity network from full-length sequences and from just the domain in common.The displayed networks all describe the pairwise relationships between 1,170 sequences from the crotonase superfamily. A. Network colored by family annotation, involving full-length sequences, thresholded at an E-value of 1×10−30. The worst edges displayed correspond to a median of 33% identity over alignments of 250 residues. B. The full-length network from A with nodes colored by sequence length and edges colored by alignment length. The same bifunctional enoyl-CoA hydratases (bECH) are marked with a dashed oval in B and C. C. Network colored by family annotation, involving just the crotonase domain, thresholded at 1×10−29. The worst edges displayed correspond to a median of 38% identity over alignments of 180 residues. D. 17 selected edges from the network in A and B. In the left panel, for each pair of sequences participating in an alignment, the log E-value versus the HMM used to define the crotonase domain is shown for each sequence—the single domain ECH (sECH) is on the bottom, and the second member of the pair is on the top—and the log BLAST E-value for the alignment between the two is in the middle. Two example bECH and sECH sequences (not alignments) are shown at the bottom of the left and middle panels. In the middle panel, each amino acid in each sequence from the 17 alignments is colored according to whether it was aligned to the crotonase domain defined by the HMM, and/or was paired to the other sequence in the BLAST alignment used to define an edge. Locations of six of these edges are marked in the enlarged view of the network in A in the right panel. The locations of the example bECH and sECH sequences are marked in the right panel using stars. See Tables S1 and S3 for quantitative comparisons.
Mentions: In the course of considering the effect on network topology from using full-length sequences or only single domains, new groupings for the enoyl-CoA hydratase family were revealed, based on changes in domain architecture. (The enoyl-CoA hydratase family (ECH) is the constituent family for which the larger ECH superfamily was named.) Most proteins are composed of two or more domains, and the combination of multiple domains may modify the function of a multidomain protein relative to its single domain homologue[33]. A comparison of full-length and single domain sequences is especially relevant for highly divergent superfamilies in which domain organization may vary across different subgroups and influence network topology. Using the ECH superfamily (also known as the crotonase superfamily)[34] for these tests, we found that there can be a large degree of qualitative and quantitative correspondence between full-length sequence networks and domain-only networks when BLAST is used as a similarity metric, thanks to the fact that it calculates local alignments. Since the edges are representations of the local regions of similarity between sequences, as demonstrated in the agreement between Fig. 5A and 5C, the topology information does not change dramatically, whether the domain in common is isolated, or is embedded in a larger, multi-domain sequence (the correlation in displayed distances between the two networks is 0.868; see Table S3 for full statistics). However, in the domain-only network, the alignments leading to a similar topology are shorter and have higher sequence similarity, leading to differences in the associated E-values. Importantly, the sequence region defined as the crotonase domain by a hidden Markov model (HMM) and the region covered by the BLAST alignment are largely overlapping; see Fig. 5D for an illustration of selected alignments used to define edges from the full-length crotonase network. At significant BLAST alignment E-values—in particular, within the range included in the network in Fig. 5A, 5B, and 5D—the BLAST alignments tend to extend slightly beyond the crotonase domain. For the crotonase and GPCR superfamilies, the families of network topologies across a range of different E-value thresholds do not change substantially whether or not the domain-only sequence is used to construct a network with BLAST (data not shown).

Bottom Line: In comparison to other primary methods, networks provide both a good representation of group-wise sequence similarity relationships and a strong visual and quantitative correlation with phylogenetic trees, while enabling analysis and visualization of much larger sets of sequences than trees or multiple sequence alignments can easily accommodate.We also define important limitations and caveats in the application of these networks.As a broadly accessible and effective tool for the exploration of protein superfamilies, sequence similarity networks show great potential for generating testable hypotheses about protein structure-function relationships.

View Article: PubMed Central - PubMed

Affiliation: Graduate Program in Biological and Medical Informatics, University of California San Francisco, San Francisco, California, United States of America.

ABSTRACT
The dramatic increase in heterogeneous types of biological data--in particular, the abundance of new protein sequences--requires fast and user-friendly methods for organizing this information in a way that enables functional inference. The most widely used strategy to link sequence or structure to function, homology-based function prediction, relies on the fundamental assumption that sequence or structural similarity implies functional similarity. New tools that extend this approach are still urgently needed to associate sequence data with biological information in ways that accommodate the real complexity of the problem, while being accessible to experimental as well as computational biologists. To address this, we have examined the application of sequence similarity networks for visualizing functional trends across protein superfamilies from the context of sequence similarity. Using three large groups of homologous proteins of varying types of structural and functional diversity--GPCRs and kinases from humans, and the crotonase superfamily of enzymes--we show that overlaying networks with orthogonal information is a powerful approach for observing functional themes and revealing outliers. In comparison to other primary methods, networks provide both a good representation of group-wise sequence similarity relationships and a strong visual and quantitative correlation with phylogenetic trees, while enabling analysis and visualization of much larger sets of sequences than trees or multiple sequence alignments can easily accommodate. We also define important limitations and caveats in the application of these networks. As a broadly accessible and effective tool for the exploration of protein superfamilies, sequence similarity networks show great potential for generating testable hypotheses about protein structure-function relationships.

Show MeSH
Related in: MedlinePlus