Limits...
OD-seq: outlier detection in multiple sequence alignments.

Jehl P, Sievers F, Higgins DG - BMC Bioinformatics (2015)

Bottom Line: Outlier sequences are found by examining the average distance of each sequence to the rest.Anomalous average distances are then found using the interquartile range of the distribution of average distances or by bootstrapping them.It can also detect outliers in sets of unaligned sequences, but with reduced accuracy.

View Article: PubMed Central - PubMed

Affiliation: UCD Conway Institute of Biomolecular and Biomedical Sciences, University College Dublin, Dublin, 4, Ireland. peter.jehl@ucdconnect.ie.

ABSTRACT

Background: Multiple sequence alignments (MSA) are widely used in sequence analysis for a variety of tasks. Outlier sequences can make downstream analyses unreliable or make the alignments less accurate while they are being constructed. This paper describes a simple method for automatically detecting outliers and accompanying software called OD-seq. It is based on finding sequences whose average distance to the rest of the sequences in a dataset, is anomalous.

Results: The software can take a MSA, distance matrix or set of unaligned sequences as input. Outlier sequences are found by examining the average distance of each sequence to the rest. Anomalous average distances are then found using the interquartile range of the distribution of average distances or by bootstrapping them. The complexity of any analysis of a distance matrix is normally at least O(N(2)) for N sequences. This is prohibitive for large N but is reduced here by using the mBed algorithm from Clustal Omega. This reduces the complexity to O(N log(N)) which makes even very large alignments easy to analyse on a single core. We tested the ability of OD-seq to detect outliers using artificial test cases of sequences from Pfam families, seeded with sequences from other Pfam families. Using a MSA as input, OD-seq is able to detect outliers with very high sensitivity and specificity.

Conclusion: OD-seq is a practical and simple method to detect outliers in MSAs. It can also detect outliers in sets of unaligned sequences, but with reduced accuracy. For medium sized alignments, of a few thousand sequences, it can detect outliers in a few seconds. Software available as http://www.bioinf.ucd.ie/download/od-seq.tar.gz.

No MeSH data available.


Related in: MedlinePlus

Computing times. Plot of compute times for varying length and number of sequences in mBed and full distance mode for the 3 different metrics. The execution times are measured in seconds and shown on the y-axis. In a the datasets contain 5000 sequences with length varying from 10 to 2000 amino acids (x-axis). The datasets used in b consist of 10 to 50,000 sequences (x-axis) of length 200
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4548304&req=5

Fig4: Computing times. Plot of compute times for varying length and number of sequences in mBed and full distance mode for the 3 different metrics. The execution times are measured in seconds and shown on the y-axis. In a the datasets contain 5000 sequences with length varying from 10 to 2000 amino acids (x-axis). The datasets used in b consist of 10 to 50,000 sequences (x-axis) of length 200

Mentions: The run times for the algorithm are expected to show a minimum complexity of O(L) in sequence length. Times for different subsets of the ABC transporters of different lengths are plotted in (Fig. 4a) and all variations of the algorithm do show approximately linear behaviour. Time complexity wrt number of sequences is expected to be O(N2) for full distance matrix mode or O(N log(N)) for mBed mode. Run times for different subsets of the ABC transporters are plotted in (Fig. 4b). Data shown is the execution time for the whole analysis including sequence read, distance computation, normalisation and output. The outlier detection for the whole ABC transporter family from Pfam with 363,409 sequences with an alignment length of 2177 takes 39 minutes on one CPU core.Fig. 4


OD-seq: outlier detection in multiple sequence alignments.

Jehl P, Sievers F, Higgins DG - BMC Bioinformatics (2015)

Computing times. Plot of compute times for varying length and number of sequences in mBed and full distance mode for the 3 different metrics. The execution times are measured in seconds and shown on the y-axis. In a the datasets contain 5000 sequences with length varying from 10 to 2000 amino acids (x-axis). The datasets used in b consist of 10 to 50,000 sequences (x-axis) of length 200
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4548304&req=5

Fig4: Computing times. Plot of compute times for varying length and number of sequences in mBed and full distance mode for the 3 different metrics. The execution times are measured in seconds and shown on the y-axis. In a the datasets contain 5000 sequences with length varying from 10 to 2000 amino acids (x-axis). The datasets used in b consist of 10 to 50,000 sequences (x-axis) of length 200
Mentions: The run times for the algorithm are expected to show a minimum complexity of O(L) in sequence length. Times for different subsets of the ABC transporters of different lengths are plotted in (Fig. 4a) and all variations of the algorithm do show approximately linear behaviour. Time complexity wrt number of sequences is expected to be O(N2) for full distance matrix mode or O(N log(N)) for mBed mode. Run times for different subsets of the ABC transporters are plotted in (Fig. 4b). Data shown is the execution time for the whole analysis including sequence read, distance computation, normalisation and output. The outlier detection for the whole ABC transporter family from Pfam with 363,409 sequences with an alignment length of 2177 takes 39 minutes on one CPU core.Fig. 4

Bottom Line: Outlier sequences are found by examining the average distance of each sequence to the rest.Anomalous average distances are then found using the interquartile range of the distribution of average distances or by bootstrapping them.It can also detect outliers in sets of unaligned sequences, but with reduced accuracy.

View Article: PubMed Central - PubMed

Affiliation: UCD Conway Institute of Biomolecular and Biomedical Sciences, University College Dublin, Dublin, 4, Ireland. peter.jehl@ucdconnect.ie.

ABSTRACT

Background: Multiple sequence alignments (MSA) are widely used in sequence analysis for a variety of tasks. Outlier sequences can make downstream analyses unreliable or make the alignments less accurate while they are being constructed. This paper describes a simple method for automatically detecting outliers and accompanying software called OD-seq. It is based on finding sequences whose average distance to the rest of the sequences in a dataset, is anomalous.

Results: The software can take a MSA, distance matrix or set of unaligned sequences as input. Outlier sequences are found by examining the average distance of each sequence to the rest. Anomalous average distances are then found using the interquartile range of the distribution of average distances or by bootstrapping them. The complexity of any analysis of a distance matrix is normally at least O(N(2)) for N sequences. This is prohibitive for large N but is reduced here by using the mBed algorithm from Clustal Omega. This reduces the complexity to O(N log(N)) which makes even very large alignments easy to analyse on a single core. We tested the ability of OD-seq to detect outliers using artificial test cases of sequences from Pfam families, seeded with sequences from other Pfam families. Using a MSA as input, OD-seq is able to detect outliers with very high sensitivity and specificity.

Conclusion: OD-seq is a practical and simple method to detect outliers in MSAs. It can also detect outliers in sets of unaligned sequences, but with reduced accuracy. For medium sized alignments, of a few thousand sequences, it can detect outliers in a few seconds. Software available as http://www.bioinf.ucd.ie/download/od-seq.tar.gz.

No MeSH data available.


Related in: MedlinePlus