Limits...
misFinder: identify mis-assemblies in an unbiased manner using reference and paired-end reads.

Zhu X, Leung HC, Wang R, Chin FY, Yiu SM, Quan G, Li Y, Zhang R, Jiang Q, Liu B, Dong Y, Zhou G, Wang Y - BMC Bioinformatics (2015)

Bottom Line: We present misFinder, a tool that aims to identify the assembly errors with high accuracy in an unbiased way and correct these errors at their mis-assembled positions to improve the assembly accuracy for downstream analysis.It combines the information of reference (or close related reference) genome and aligned paired-end reads to the assembled sequence.Assembly errors and correct assemblies corresponding to structural variations can be detected by comparing the genome reference and assembled sequence.

View Article: PubMed Central - PubMed

Affiliation: College of Computer Sciences and Information Engineering, Harbin Normal University, Harbin, Heilongjiang, China. zhuxiao.hit@gmail.com.

ABSTRACT

Background: Because of the short read length of high throughput sequencing data, assembly errors are introduced in genome assembly, which may have adverse impact to the downstream data analysis. Several tools have been developed to eliminate these errors by either 1) comparing the assembled sequences with some similar reference genome, or 2) analyzing paired-end reads aligned to the assembled sequences and determining inconsistent features alone mis-assembled sequences. However, the former approach cannot distinguish real structural variations between the target genome and the reference genome while the latter approach could have many false positive detections (correctly assembled sequence being considered as mis-assembled sequence).

Results: We present misFinder, a tool that aims to identify the assembly errors with high accuracy in an unbiased way and correct these errors at their mis-assembled positions to improve the assembly accuracy for downstream analysis. It combines the information of reference (or close related reference) genome and aligned paired-end reads to the assembled sequence. Assembly errors and correct assemblies corresponding to structural variations can be detected by comparing the genome reference and assembled sequence. Different types of assembly errors can then be distinguished from the mis-assembled sequence by analyzing the aligned paired-end reads using multiple features derived from coverage and consistence of insert distance to obtain high confident error calls.

Conclusions: We tested the performance of misFinder on both simulated and real paired-end reads data, and misFinder gave accurate error calls with only very few miscalls. And, we further compared misFinder with QUAST and REAPR. misFinder outperformed QUAST and REAPR by 1) identified more true positive mis-assemblies with very few false positives and false negatives, and 2) distinguished the correct assemblies corresponding to structural variations from mis-assembled sequence. misFinder can be freely downloaded from https://github.com/hitbio/misFinder .

No MeSH data available.


Related in: MedlinePlus

Workflow of misFinder. MisFinder consists of three major steps. (1) Identify putative mis-assembles. Scaffolds (contigs) and genome reference are first used to generate the BLASTN alignments followed by the alignment processing that the redundant alignments will be removed, and then the putative mis-assembles are identified according to their non-redundant alignments. (2) Breakpoint computation. Paired-end reads are aligned to the scaffolds to make the breakpoints more accurate. (3) Mis-assembly validation. Putative mis-assemblies are validated according to the alignment information of paired-end reads on scaffolds
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4647709&req=5

Fig1: Workflow of misFinder. MisFinder consists of three major steps. (1) Identify putative mis-assembles. Scaffolds (contigs) and genome reference are first used to generate the BLASTN alignments followed by the alignment processing that the redundant alignments will be removed, and then the putative mis-assembles are identified according to their non-redundant alignments. (2) Breakpoint computation. Paired-end reads are aligned to the scaffolds to make the breakpoints more accurate. (3) Mis-assembly validation. Putative mis-assemblies are validated according to the alignment information of paired-end reads on scaffolds

Mentions: The workflow of misFinder to identify assembly errors is shown in Fig. 1. misFinder is based on BLASTN [20], assembly (contigs/scaffolds), genome reference and paired-end reads from Solexa sequencing technology are its input, with aims to identify the mis-assemblies i.e., the assembly errors, and correct these errors to increase the accuracy of the assembly. It consists of three major steps: (1) Identify the differences between scaffolds and reference using BLASTN alignment; (2) Compute the breakpoints according to paired-end reads alignment information; (3) Validate the differences according to paired-end reads alignment information to distinguish the assembly errors and correct assemblies corresponding to structural variations. The algorithm of misFinder will be described in details in the following sections.Fig. 1


misFinder: identify mis-assemblies in an unbiased manner using reference and paired-end reads.

Zhu X, Leung HC, Wang R, Chin FY, Yiu SM, Quan G, Li Y, Zhang R, Jiang Q, Liu B, Dong Y, Zhou G, Wang Y - BMC Bioinformatics (2015)

Workflow of misFinder. MisFinder consists of three major steps. (1) Identify putative mis-assembles. Scaffolds (contigs) and genome reference are first used to generate the BLASTN alignments followed by the alignment processing that the redundant alignments will be removed, and then the putative mis-assembles are identified according to their non-redundant alignments. (2) Breakpoint computation. Paired-end reads are aligned to the scaffolds to make the breakpoints more accurate. (3) Mis-assembly validation. Putative mis-assemblies are validated according to the alignment information of paired-end reads on scaffolds
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4647709&req=5

Fig1: Workflow of misFinder. MisFinder consists of three major steps. (1) Identify putative mis-assembles. Scaffolds (contigs) and genome reference are first used to generate the BLASTN alignments followed by the alignment processing that the redundant alignments will be removed, and then the putative mis-assembles are identified according to their non-redundant alignments. (2) Breakpoint computation. Paired-end reads are aligned to the scaffolds to make the breakpoints more accurate. (3) Mis-assembly validation. Putative mis-assemblies are validated according to the alignment information of paired-end reads on scaffolds
Mentions: The workflow of misFinder to identify assembly errors is shown in Fig. 1. misFinder is based on BLASTN [20], assembly (contigs/scaffolds), genome reference and paired-end reads from Solexa sequencing technology are its input, with aims to identify the mis-assemblies i.e., the assembly errors, and correct these errors to increase the accuracy of the assembly. It consists of three major steps: (1) Identify the differences between scaffolds and reference using BLASTN alignment; (2) Compute the breakpoints according to paired-end reads alignment information; (3) Validate the differences according to paired-end reads alignment information to distinguish the assembly errors and correct assemblies corresponding to structural variations. The algorithm of misFinder will be described in details in the following sections.Fig. 1

Bottom Line: We present misFinder, a tool that aims to identify the assembly errors with high accuracy in an unbiased way and correct these errors at their mis-assembled positions to improve the assembly accuracy for downstream analysis.It combines the information of reference (or close related reference) genome and aligned paired-end reads to the assembled sequence.Assembly errors and correct assemblies corresponding to structural variations can be detected by comparing the genome reference and assembled sequence.

View Article: PubMed Central - PubMed

Affiliation: College of Computer Sciences and Information Engineering, Harbin Normal University, Harbin, Heilongjiang, China. zhuxiao.hit@gmail.com.

ABSTRACT

Background: Because of the short read length of high throughput sequencing data, assembly errors are introduced in genome assembly, which may have adverse impact to the downstream data analysis. Several tools have been developed to eliminate these errors by either 1) comparing the assembled sequences with some similar reference genome, or 2) analyzing paired-end reads aligned to the assembled sequences and determining inconsistent features alone mis-assembled sequences. However, the former approach cannot distinguish real structural variations between the target genome and the reference genome while the latter approach could have many false positive detections (correctly assembled sequence being considered as mis-assembled sequence).

Results: We present misFinder, a tool that aims to identify the assembly errors with high accuracy in an unbiased way and correct these errors at their mis-assembled positions to improve the assembly accuracy for downstream analysis. It combines the information of reference (or close related reference) genome and aligned paired-end reads to the assembled sequence. Assembly errors and correct assemblies corresponding to structural variations can be detected by comparing the genome reference and assembled sequence. Different types of assembly errors can then be distinguished from the mis-assembled sequence by analyzing the aligned paired-end reads using multiple features derived from coverage and consistence of insert distance to obtain high confident error calls.

Conclusions: We tested the performance of misFinder on both simulated and real paired-end reads data, and misFinder gave accurate error calls with only very few miscalls. And, we further compared misFinder with QUAST and REAPR. misFinder outperformed QUAST and REAPR by 1) identified more true positive mis-assemblies with very few false positives and false negatives, and 2) distinguished the correct assemblies corresponding to structural variations from mis-assembled sequence. misFinder can be freely downloaded from https://github.com/hitbio/misFinder .

No MeSH data available.


Related in: MedlinePlus