Limits...
FANSe: an accurate algorithm for quantitative mapping of large scale sequencing reads.

Zhang G, Fedyunin I, Kirchner S, Xiao C, Valleriani A, Ignatova Z - Nucleic Acids Res. (2012)

Bottom Line: The most crucial step in data processing from high-throughput sequencing applications is the accurate and sensitive alignment of the sequencing reads to reference genomes or transcriptomes.The accurate detection of insertions and deletions (indels) and errors introduced by the sequencing platform or by misreading of modified nucleotides is essential for the quantitative processing of the RNA-based sequencing (RNA-Seq) datasets and for the identification of genetic variations and modification patterns.It shows a remarkable coverage of low-abundance mRNAs which is important for quantitative processing of RNA-Seq datasets.

View Article: PubMed Central - PubMed

Affiliation: Biochemistry, Institute of Biochemistry and Biology, University of Potsdam, Karl-Liebknecht-Str. 24-25, 14467 Potsdam, Germany. zhanggong@jnu.edu.cn

ABSTRACT
The most crucial step in data processing from high-throughput sequencing applications is the accurate and sensitive alignment of the sequencing reads to reference genomes or transcriptomes. The accurate detection of insertions and deletions (indels) and errors introduced by the sequencing platform or by misreading of modified nucleotides is essential for the quantitative processing of the RNA-based sequencing (RNA-Seq) datasets and for the identification of genetic variations and modification patterns. We developed a new, fast and accurate algorithm for nucleic acid sequence analysis, FANSe, with adjustable mismatch allowance settings and ability to handle indels to accurately and quantitatively map millions of reads to small or large reference genomes. It is a seed-based algorithm which uses the whole read information for mapping and high sensitivity and low ambiguity are achieved by using short and non-overlapping reads. Furthermore, FANSe uses hotspot score to prioritize the processing of highly possible matches and implements modified Smith-Watermann refinement with reduced scoring matrix to accelerate the calculation without compromising its sensitivity. The FANSe algorithm stably processes datasets from various sequencing platforms, masked or unmasked and small or large genomes. It shows a remarkable coverage of low-abundance mRNAs which is important for quantitative processing of RNA-Seq datasets.

Show MeSH

Related in: MedlinePlus

Comparison of the sensitivity between FANSe and Bowtie by mapping of in silico simulated datasets. (A) Sensitivity of mapping indel-free reads from the E. coli genome and masked human chromosome 21 reference sequence. FANSe was run with 6-nt seeds. (B) Sensitivity of mapping reads from the E. coli genome with a 1% substitution rate and an indel rate ranging from 0.5% to 4%. Indel search is enabled. All tests with Bowtie were run with three mismatch allowance. MM, mismatches.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3367211&req=5

gks196-F4: Comparison of the sensitivity between FANSe and Bowtie by mapping of in silico simulated datasets. (A) Sensitivity of mapping indel-free reads from the E. coli genome and masked human chromosome 21 reference sequence. FANSe was run with 6-nt seeds. (B) Sensitivity of mapping reads from the E. coli genome with a 1% substitution rate and an indel rate ranging from 0.5% to 4%. Indel search is enabled. All tests with Bowtie were run with three mismatch allowance. MM, mismatches.

Mentions: To avoid bias as a result of choosing the dataset and application, we used simulated, random sequencing datasets and compared the accuracy of FANSe and Bowtie. For both simulated E. coli reads and human chromosome 21 reads, FANSe and Bowtie achieved a comparably high level of sensitivity, ∼100 %, when the substitution rate was varied from 0.5% to 1% (Figure 4A). Further increase in the substitution rate of up to 8% caused a decrease in the sensitivity of Bowtie by 30–80% depending on the read length, whereas the sensitivity of FANSe only decreased by 10% (Figure 4A). In almost all indel-free cases (Figure 4A) the correctness of FANSe ranged from 97.2% to 99.7% (average 98.8%) which is similar to the correctness of Bowtie (98.0–99.7%, average 98.8%). Increasing the mismatches from three to four decreased the number of unmapped reads by half (Figure 4A); however only a marginal decrease in the correctness, from 98.2% to 99.6% to 97.2% to 99.5%, was detected. The high sensitivity and correctness found when mapping in silico- generated datasets confirms the theoretically estimated accuracy (Supplementary Figure S1).Figure 4.


FANSe: an accurate algorithm for quantitative mapping of large scale sequencing reads.

Zhang G, Fedyunin I, Kirchner S, Xiao C, Valleriani A, Ignatova Z - Nucleic Acids Res. (2012)

Comparison of the sensitivity between FANSe and Bowtie by mapping of in silico simulated datasets. (A) Sensitivity of mapping indel-free reads from the E. coli genome and masked human chromosome 21 reference sequence. FANSe was run with 6-nt seeds. (B) Sensitivity of mapping reads from the E. coli genome with a 1% substitution rate and an indel rate ranging from 0.5% to 4%. Indel search is enabled. All tests with Bowtie were run with three mismatch allowance. MM, mismatches.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3367211&req=5

gks196-F4: Comparison of the sensitivity between FANSe and Bowtie by mapping of in silico simulated datasets. (A) Sensitivity of mapping indel-free reads from the E. coli genome and masked human chromosome 21 reference sequence. FANSe was run with 6-nt seeds. (B) Sensitivity of mapping reads from the E. coli genome with a 1% substitution rate and an indel rate ranging from 0.5% to 4%. Indel search is enabled. All tests with Bowtie were run with three mismatch allowance. MM, mismatches.
Mentions: To avoid bias as a result of choosing the dataset and application, we used simulated, random sequencing datasets and compared the accuracy of FANSe and Bowtie. For both simulated E. coli reads and human chromosome 21 reads, FANSe and Bowtie achieved a comparably high level of sensitivity, ∼100 %, when the substitution rate was varied from 0.5% to 1% (Figure 4A). Further increase in the substitution rate of up to 8% caused a decrease in the sensitivity of Bowtie by 30–80% depending on the read length, whereas the sensitivity of FANSe only decreased by 10% (Figure 4A). In almost all indel-free cases (Figure 4A) the correctness of FANSe ranged from 97.2% to 99.7% (average 98.8%) which is similar to the correctness of Bowtie (98.0–99.7%, average 98.8%). Increasing the mismatches from three to four decreased the number of unmapped reads by half (Figure 4A); however only a marginal decrease in the correctness, from 98.2% to 99.6% to 97.2% to 99.5%, was detected. The high sensitivity and correctness found when mapping in silico- generated datasets confirms the theoretically estimated accuracy (Supplementary Figure S1).Figure 4.

Bottom Line: The most crucial step in data processing from high-throughput sequencing applications is the accurate and sensitive alignment of the sequencing reads to reference genomes or transcriptomes.The accurate detection of insertions and deletions (indels) and errors introduced by the sequencing platform or by misreading of modified nucleotides is essential for the quantitative processing of the RNA-based sequencing (RNA-Seq) datasets and for the identification of genetic variations and modification patterns.It shows a remarkable coverage of low-abundance mRNAs which is important for quantitative processing of RNA-Seq datasets.

View Article: PubMed Central - PubMed

Affiliation: Biochemistry, Institute of Biochemistry and Biology, University of Potsdam, Karl-Liebknecht-Str. 24-25, 14467 Potsdam, Germany. zhanggong@jnu.edu.cn

ABSTRACT
The most crucial step in data processing from high-throughput sequencing applications is the accurate and sensitive alignment of the sequencing reads to reference genomes or transcriptomes. The accurate detection of insertions and deletions (indels) and errors introduced by the sequencing platform or by misreading of modified nucleotides is essential for the quantitative processing of the RNA-based sequencing (RNA-Seq) datasets and for the identification of genetic variations and modification patterns. We developed a new, fast and accurate algorithm for nucleic acid sequence analysis, FANSe, with adjustable mismatch allowance settings and ability to handle indels to accurately and quantitatively map millions of reads to small or large reference genomes. It is a seed-based algorithm which uses the whole read information for mapping and high sensitivity and low ambiguity are achieved by using short and non-overlapping reads. Furthermore, FANSe uses hotspot score to prioritize the processing of highly possible matches and implements modified Smith-Watermann refinement with reduced scoring matrix to accelerate the calculation without compromising its sensitivity. The FANSe algorithm stably processes datasets from various sequencing platforms, masked or unmasked and small or large genomes. It shows a remarkable coverage of low-abundance mRNAs which is important for quantitative processing of RNA-Seq datasets.

Show MeSH
Related in: MedlinePlus