Limits...
misFinder: identify mis-assemblies in an unbiased manner using reference and paired-end reads.

Zhu X, Leung HC, Wang R, Chin FY, Yiu SM, Quan G, Li Y, Zhang R, Jiang Q, Liu B, Dong Y, Zhou G, Wang Y - BMC Bioinformatics (2015)

Bottom Line: We present misFinder, a tool that aims to identify the assembly errors with high accuracy in an unbiased way and correct these errors at their mis-assembled positions to improve the assembly accuracy for downstream analysis.It combines the information of reference (or close related reference) genome and aligned paired-end reads to the assembled sequence.Assembly errors and correct assemblies corresponding to structural variations can be detected by comparing the genome reference and assembled sequence.

View Article: PubMed Central - PubMed

Affiliation: College of Computer Sciences and Information Engineering, Harbin Normal University, Harbin, Heilongjiang, China. zhuxiao.hit@gmail.com.

ABSTRACT

Background: Because of the short read length of high throughput sequencing data, assembly errors are introduced in genome assembly, which may have adverse impact to the downstream data analysis. Several tools have been developed to eliminate these errors by either 1) comparing the assembled sequences with some similar reference genome, or 2) analyzing paired-end reads aligned to the assembled sequences and determining inconsistent features alone mis-assembled sequences. However, the former approach cannot distinguish real structural variations between the target genome and the reference genome while the latter approach could have many false positive detections (correctly assembled sequence being considered as mis-assembled sequence).

Results: We present misFinder, a tool that aims to identify the assembly errors with high accuracy in an unbiased way and correct these errors at their mis-assembled positions to improve the assembly accuracy for downstream analysis. It combines the information of reference (or close related reference) genome and aligned paired-end reads to the assembled sequence. Assembly errors and correct assemblies corresponding to structural variations can be detected by comparing the genome reference and assembled sequence. Different types of assembly errors can then be distinguished from the mis-assembled sequence by analyzing the aligned paired-end reads using multiple features derived from coverage and consistence of insert distance to obtain high confident error calls.

Conclusions: We tested the performance of misFinder on both simulated and real paired-end reads data, and misFinder gave accurate error calls with only very few miscalls. And, we further compared misFinder with QUAST and REAPR. misFinder outperformed QUAST and REAPR by 1) identified more true positive mis-assemblies with very few false positives and false negatives, and 2) distinguished the correct assemblies corresponding to structural variations from mis-assembled sequence. misFinder can be freely downloaded from https://github.com/hitbio/misFinder .

No MeSH data available.


Related in: MedlinePlus

Visualization of misFinder output for identifying assembly errors on E.coli simulated data. The running results of misFinder are shown using Circos [30]. The ideogram (green) shows the circularized selected scaffolds containing errors and structural variations. The scatter plot shows the identified assembly errors (red circles for misjoins, orange circles for indel errors) and correct assemblies (blue circles for correct indels, green circles for false misjoins) corresponding to structural variations by misFinder. There are 27 assembly errors and 8 correct assemblies corresponding to artificial SVs. The disagreement plot marks the disagreement for each base in scaffolds. The zero coverage plot marks each nucleotide with zero coverage. The multi-align ratio plot shows the ratio of multiple aligned reads for each region of 500 bp, ranging from 0 to 1. The discordant ratio plot shows the discordant ratio of discordant read pairs for each 500-bp region in scaffolds, ranging from 0 to 1. The last plot shows the read coverage in scaffolds
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4647709&req=5

Fig5: Visualization of misFinder output for identifying assembly errors on E.coli simulated data. The running results of misFinder are shown using Circos [30]. The ideogram (green) shows the circularized selected scaffolds containing errors and structural variations. The scatter plot shows the identified assembly errors (red circles for misjoins, orange circles for indel errors) and correct assemblies (blue circles for correct indels, green circles for false misjoins) corresponding to structural variations by misFinder. There are 27 assembly errors and 8 correct assemblies corresponding to artificial SVs. The disagreement plot marks the disagreement for each base in scaffolds. The zero coverage plot marks each nucleotide with zero coverage. The multi-align ratio plot shows the ratio of multiple aligned reads for each region of 500 bp, ranging from 0 to 1. The discordant ratio plot shows the discordant ratio of discordant read pairs for each 500-bp region in scaffolds, ranging from 0 to 1. The last plot shows the read coverage in scaffolds

Mentions: We tested misFinder by first aligning the assembly to the reference using BLASTN (v2.2.25+) [20] with option ‘-best_hit_overhang 0.1’ to reduce the redundant short align segments. The remaining redundant short segments were further removed to obtain non-redundant align segments, therefore most of the scaffolds each had only one large align segment with only a few mismatched bases, e.g., insertions/deletions and mismatches. Then, the paired-end reads were aligned to the assembly to assist in extracting the suspicious mis-assembly breakpoint regions before analyzing their alignment information. Finally, 27 assembly errors including 3 misjoins, 20 insertion errors and 4 deletion errors were identified according to multiple features of their paired-end reads information (Fig. 5).Fig. 5


misFinder: identify mis-assemblies in an unbiased manner using reference and paired-end reads.

Zhu X, Leung HC, Wang R, Chin FY, Yiu SM, Quan G, Li Y, Zhang R, Jiang Q, Liu B, Dong Y, Zhou G, Wang Y - BMC Bioinformatics (2015)

Visualization of misFinder output for identifying assembly errors on E.coli simulated data. The running results of misFinder are shown using Circos [30]. The ideogram (green) shows the circularized selected scaffolds containing errors and structural variations. The scatter plot shows the identified assembly errors (red circles for misjoins, orange circles for indel errors) and correct assemblies (blue circles for correct indels, green circles for false misjoins) corresponding to structural variations by misFinder. There are 27 assembly errors and 8 correct assemblies corresponding to artificial SVs. The disagreement plot marks the disagreement for each base in scaffolds. The zero coverage plot marks each nucleotide with zero coverage. The multi-align ratio plot shows the ratio of multiple aligned reads for each region of 500 bp, ranging from 0 to 1. The discordant ratio plot shows the discordant ratio of discordant read pairs for each 500-bp region in scaffolds, ranging from 0 to 1. The last plot shows the read coverage in scaffolds
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4647709&req=5

Fig5: Visualization of misFinder output for identifying assembly errors on E.coli simulated data. The running results of misFinder are shown using Circos [30]. The ideogram (green) shows the circularized selected scaffolds containing errors and structural variations. The scatter plot shows the identified assembly errors (red circles for misjoins, orange circles for indel errors) and correct assemblies (blue circles for correct indels, green circles for false misjoins) corresponding to structural variations by misFinder. There are 27 assembly errors and 8 correct assemblies corresponding to artificial SVs. The disagreement plot marks the disagreement for each base in scaffolds. The zero coverage plot marks each nucleotide with zero coverage. The multi-align ratio plot shows the ratio of multiple aligned reads for each region of 500 bp, ranging from 0 to 1. The discordant ratio plot shows the discordant ratio of discordant read pairs for each 500-bp region in scaffolds, ranging from 0 to 1. The last plot shows the read coverage in scaffolds
Mentions: We tested misFinder by first aligning the assembly to the reference using BLASTN (v2.2.25+) [20] with option ‘-best_hit_overhang 0.1’ to reduce the redundant short align segments. The remaining redundant short segments were further removed to obtain non-redundant align segments, therefore most of the scaffolds each had only one large align segment with only a few mismatched bases, e.g., insertions/deletions and mismatches. Then, the paired-end reads were aligned to the assembly to assist in extracting the suspicious mis-assembly breakpoint regions before analyzing their alignment information. Finally, 27 assembly errors including 3 misjoins, 20 insertion errors and 4 deletion errors were identified according to multiple features of their paired-end reads information (Fig. 5).Fig. 5

Bottom Line: We present misFinder, a tool that aims to identify the assembly errors with high accuracy in an unbiased way and correct these errors at their mis-assembled positions to improve the assembly accuracy for downstream analysis.It combines the information of reference (or close related reference) genome and aligned paired-end reads to the assembled sequence.Assembly errors and correct assemblies corresponding to structural variations can be detected by comparing the genome reference and assembled sequence.

View Article: PubMed Central - PubMed

Affiliation: College of Computer Sciences and Information Engineering, Harbin Normal University, Harbin, Heilongjiang, China. zhuxiao.hit@gmail.com.

ABSTRACT

Background: Because of the short read length of high throughput sequencing data, assembly errors are introduced in genome assembly, which may have adverse impact to the downstream data analysis. Several tools have been developed to eliminate these errors by either 1) comparing the assembled sequences with some similar reference genome, or 2) analyzing paired-end reads aligned to the assembled sequences and determining inconsistent features alone mis-assembled sequences. However, the former approach cannot distinguish real structural variations between the target genome and the reference genome while the latter approach could have many false positive detections (correctly assembled sequence being considered as mis-assembled sequence).

Results: We present misFinder, a tool that aims to identify the assembly errors with high accuracy in an unbiased way and correct these errors at their mis-assembled positions to improve the assembly accuracy for downstream analysis. It combines the information of reference (or close related reference) genome and aligned paired-end reads to the assembled sequence. Assembly errors and correct assemblies corresponding to structural variations can be detected by comparing the genome reference and assembled sequence. Different types of assembly errors can then be distinguished from the mis-assembled sequence by analyzing the aligned paired-end reads using multiple features derived from coverage and consistence of insert distance to obtain high confident error calls.

Conclusions: We tested the performance of misFinder on both simulated and real paired-end reads data, and misFinder gave accurate error calls with only very few miscalls. And, we further compared misFinder with QUAST and REAPR. misFinder outperformed QUAST and REAPR by 1) identified more true positive mis-assemblies with very few false positives and false negatives, and 2) distinguished the correct assemblies corresponding to structural variations from mis-assembled sequence. misFinder can be freely downloaded from https://github.com/hitbio/misFinder .

No MeSH data available.


Related in: MedlinePlus