Limits...
FANSe: an accurate algorithm for quantitative mapping of large scale sequencing reads.

Zhang G, Fedyunin I, Kirchner S, Xiao C, Valleriani A, Ignatova Z - Nucleic Acids Res. (2012)

Bottom Line: The most crucial step in data processing from high-throughput sequencing applications is the accurate and sensitive alignment of the sequencing reads to reference genomes or transcriptomes.The accurate detection of insertions and deletions (indels) and errors introduced by the sequencing platform or by misreading of modified nucleotides is essential for the quantitative processing of the RNA-based sequencing (RNA-Seq) datasets and for the identification of genetic variations and modification patterns.It shows a remarkable coverage of low-abundance mRNAs which is important for quantitative processing of RNA-Seq datasets.

View Article: PubMed Central - PubMed

Affiliation: Biochemistry, Institute of Biochemistry and Biology, University of Potsdam, Karl-Liebknecht-Str. 24-25, 14467 Potsdam, Germany. zhanggong@jnu.edu.cn

ABSTRACT
The most crucial step in data processing from high-throughput sequencing applications is the accurate and sensitive alignment of the sequencing reads to reference genomes or transcriptomes. The accurate detection of insertions and deletions (indels) and errors introduced by the sequencing platform or by misreading of modified nucleotides is essential for the quantitative processing of the RNA-based sequencing (RNA-Seq) datasets and for the identification of genetic variations and modification patterns. We developed a new, fast and accurate algorithm for nucleic acid sequence analysis, FANSe, with adjustable mismatch allowance settings and ability to handle indels to accurately and quantitatively map millions of reads to small or large reference genomes. It is a seed-based algorithm which uses the whole read information for mapping and high sensitivity and low ambiguity are achieved by using short and non-overlapping reads. Furthermore, FANSe uses hotspot score to prioritize the processing of highly possible matches and implements modified Smith-Watermann refinement with reduced scoring matrix to accelerate the calculation without compromising its sensitivity. The FANSe algorithm stably processes datasets from various sequencing platforms, masked or unmasked and small or large genomes. It shows a remarkable coverage of low-abundance mRNAs which is important for quantitative processing of RNA-Seq datasets.

Show MeSH

Related in: MedlinePlus

Sensitivity and speed of FANSe compared with other mapping algorithms. Mapped reads (left panels) and running time (right panels) for the mapping of E. coli mRNA random fragments to the reference genome with deactivated (A) or activated (B) indel detection using 8-nt seeds. One, two or three mismatches were allowed when indel detection was switched off. (C) Comparison of the read hits for the mRNA random fragments of each E. coli gene mapped by FANSe and BWA (left panel) or BLAT (right panel). (D) Mapped reads (left panel) and running time (right panel) by mapping the HeLa mRNA random fragments to the masked human chromosome 21 allowing three mismatches. Note that some algorithms were only run in indel-enabled (mrsFAST, BLAT) or indel-disabled mode (Bowtie). BLAT mapped 1459 reads within 1 min and is not included in the plots as it is out of scale compared with the other algorithms. GM, Genomemapper.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3367211&req=5

gks196-F2: Sensitivity and speed of FANSe compared with other mapping algorithms. Mapped reads (left panels) and running time (right panels) for the mapping of E. coli mRNA random fragments to the reference genome with deactivated (A) or activated (B) indel detection using 8-nt seeds. One, two or three mismatches were allowed when indel detection was switched off. (C) Comparison of the read hits for the mRNA random fragments of each E. coli gene mapped by FANSe and BWA (left panel) or BLAT (right panel). (D) Mapped reads (left panel) and running time (right panel) by mapping the HeLa mRNA random fragments to the masked human chromosome 21 allowing three mismatches. Note that some algorithms were only run in indel-enabled (mrsFAST, BLAT) or indel-disabled mode (Bowtie). BLAT mapped 1459 reads within 1 min and is not included in the plots as it is out of scale compared with the other algorithms. GM, Genomemapper.

Mentions: Next, we compared the ability of FANSe to map short reads generated from RNA-seq with other algorithms. We extracted the total mRNA from exponentially growing E. coli or eukaryotic HeLa cells, randomly fragmented them into short fragments and sequenced them on the Illumina GAIIx platform. Compared with all of the tested programs, FANSe showed the highest sensitivity in read mapping with disabled indel detection (Figure 2A). When indels were considered, FANSe also achieved the best sensitivity among the algorithms that are capable of handling indels (e.g. BLAT, BWA, mrsFAST and SHRiMP) (Figure 2B). Even though the Illumina GAIIx platform operates at a very low indel rate [estimated to be 0.0032% per nucleotide (43)], the indel search with the Smith–Waterman refinement in FANSe was enabled that increased the mapped reads by 6.5%. With a minimum read length of 18 nt in this dataset, FANSe achieved a complete mapping of all reads when 6-nt seeds were used and one or two mismatches were allowed (Table 1). Note that SOAP2 did not work for this dataset because of an internal error, and Genomemapper only mapped a very small fraction of the reads. When using 8-nt seeds, only 0.27% fewer reads were mapped with FANSe than when 6-nt seeds were used; however the mapping speed was accelerated by >12-fold.Figure 2.


FANSe: an accurate algorithm for quantitative mapping of large scale sequencing reads.

Zhang G, Fedyunin I, Kirchner S, Xiao C, Valleriani A, Ignatova Z - Nucleic Acids Res. (2012)

Sensitivity and speed of FANSe compared with other mapping algorithms. Mapped reads (left panels) and running time (right panels) for the mapping of E. coli mRNA random fragments to the reference genome with deactivated (A) or activated (B) indel detection using 8-nt seeds. One, two or three mismatches were allowed when indel detection was switched off. (C) Comparison of the read hits for the mRNA random fragments of each E. coli gene mapped by FANSe and BWA (left panel) or BLAT (right panel). (D) Mapped reads (left panel) and running time (right panel) by mapping the HeLa mRNA random fragments to the masked human chromosome 21 allowing three mismatches. Note that some algorithms were only run in indel-enabled (mrsFAST, BLAT) or indel-disabled mode (Bowtie). BLAT mapped 1459 reads within 1 min and is not included in the plots as it is out of scale compared with the other algorithms. GM, Genomemapper.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3367211&req=5

gks196-F2: Sensitivity and speed of FANSe compared with other mapping algorithms. Mapped reads (left panels) and running time (right panels) for the mapping of E. coli mRNA random fragments to the reference genome with deactivated (A) or activated (B) indel detection using 8-nt seeds. One, two or three mismatches were allowed when indel detection was switched off. (C) Comparison of the read hits for the mRNA random fragments of each E. coli gene mapped by FANSe and BWA (left panel) or BLAT (right panel). (D) Mapped reads (left panel) and running time (right panel) by mapping the HeLa mRNA random fragments to the masked human chromosome 21 allowing three mismatches. Note that some algorithms were only run in indel-enabled (mrsFAST, BLAT) or indel-disabled mode (Bowtie). BLAT mapped 1459 reads within 1 min and is not included in the plots as it is out of scale compared with the other algorithms. GM, Genomemapper.
Mentions: Next, we compared the ability of FANSe to map short reads generated from RNA-seq with other algorithms. We extracted the total mRNA from exponentially growing E. coli or eukaryotic HeLa cells, randomly fragmented them into short fragments and sequenced them on the Illumina GAIIx platform. Compared with all of the tested programs, FANSe showed the highest sensitivity in read mapping with disabled indel detection (Figure 2A). When indels were considered, FANSe also achieved the best sensitivity among the algorithms that are capable of handling indels (e.g. BLAT, BWA, mrsFAST and SHRiMP) (Figure 2B). Even though the Illumina GAIIx platform operates at a very low indel rate [estimated to be 0.0032% per nucleotide (43)], the indel search with the Smith–Waterman refinement in FANSe was enabled that increased the mapped reads by 6.5%. With a minimum read length of 18 nt in this dataset, FANSe achieved a complete mapping of all reads when 6-nt seeds were used and one or two mismatches were allowed (Table 1). Note that SOAP2 did not work for this dataset because of an internal error, and Genomemapper only mapped a very small fraction of the reads. When using 8-nt seeds, only 0.27% fewer reads were mapped with FANSe than when 6-nt seeds were used; however the mapping speed was accelerated by >12-fold.Figure 2.

Bottom Line: The most crucial step in data processing from high-throughput sequencing applications is the accurate and sensitive alignment of the sequencing reads to reference genomes or transcriptomes.The accurate detection of insertions and deletions (indels) and errors introduced by the sequencing platform or by misreading of modified nucleotides is essential for the quantitative processing of the RNA-based sequencing (RNA-Seq) datasets and for the identification of genetic variations and modification patterns.It shows a remarkable coverage of low-abundance mRNAs which is important for quantitative processing of RNA-Seq datasets.

View Article: PubMed Central - PubMed

Affiliation: Biochemistry, Institute of Biochemistry and Biology, University of Potsdam, Karl-Liebknecht-Str. 24-25, 14467 Potsdam, Germany. zhanggong@jnu.edu.cn

ABSTRACT
The most crucial step in data processing from high-throughput sequencing applications is the accurate and sensitive alignment of the sequencing reads to reference genomes or transcriptomes. The accurate detection of insertions and deletions (indels) and errors introduced by the sequencing platform or by misreading of modified nucleotides is essential for the quantitative processing of the RNA-based sequencing (RNA-Seq) datasets and for the identification of genetic variations and modification patterns. We developed a new, fast and accurate algorithm for nucleic acid sequence analysis, FANSe, with adjustable mismatch allowance settings and ability to handle indels to accurately and quantitatively map millions of reads to small or large reference genomes. It is a seed-based algorithm which uses the whole read information for mapping and high sensitivity and low ambiguity are achieved by using short and non-overlapping reads. Furthermore, FANSe uses hotspot score to prioritize the processing of highly possible matches and implements modified Smith-Watermann refinement with reduced scoring matrix to accelerate the calculation without compromising its sensitivity. The FANSe algorithm stably processes datasets from various sequencing platforms, masked or unmasked and small or large genomes. It shows a remarkable coverage of low-abundance mRNAs which is important for quantitative processing of RNA-Seq datasets.

Show MeSH
Related in: MedlinePlus