Limits...
FANSe: an accurate algorithm for quantitative mapping of large scale sequencing reads.

Zhang G, Fedyunin I, Kirchner S, Xiao C, Valleriani A, Ignatova Z - Nucleic Acids Res. (2012)

Bottom Line: The most crucial step in data processing from high-throughput sequencing applications is the accurate and sensitive alignment of the sequencing reads to reference genomes or transcriptomes.The accurate detection of insertions and deletions (indels) and errors introduced by the sequencing platform or by misreading of modified nucleotides is essential for the quantitative processing of the RNA-based sequencing (RNA-Seq) datasets and for the identification of genetic variations and modification patterns.It shows a remarkable coverage of low-abundance mRNAs which is important for quantitative processing of RNA-Seq datasets.

View Article: PubMed Central - PubMed

Affiliation: Biochemistry, Institute of Biochemistry and Biology, University of Potsdam, Karl-Liebknecht-Str. 24-25, 14467 Potsdam, Germany. zhanggong@jnu.edu.cn

ABSTRACT
The most crucial step in data processing from high-throughput sequencing applications is the accurate and sensitive alignment of the sequencing reads to reference genomes or transcriptomes. The accurate detection of insertions and deletions (indels) and errors introduced by the sequencing platform or by misreading of modified nucleotides is essential for the quantitative processing of the RNA-based sequencing (RNA-Seq) datasets and for the identification of genetic variations and modification patterns. We developed a new, fast and accurate algorithm for nucleic acid sequence analysis, FANSe, with adjustable mismatch allowance settings and ability to handle indels to accurately and quantitatively map millions of reads to small or large reference genomes. It is a seed-based algorithm which uses the whole read information for mapping and high sensitivity and low ambiguity are achieved by using short and non-overlapping reads. Furthermore, FANSe uses hotspot score to prioritize the processing of highly possible matches and implements modified Smith-Watermann refinement with reduced scoring matrix to accelerate the calculation without compromising its sensitivity. The FANSe algorithm stably processes datasets from various sequencing platforms, masked or unmasked and small or large genomes. It shows a remarkable coverage of low-abundance mRNAs which is important for quantitative processing of RNA-Seq datasets.

Show MeSH

Related in: MedlinePlus

Principle of the FANSe algorithm. Scheme of a read covered by non-overlapping seeds (A) or an additional overlapping seed (B). The dashed lines mark the offset, i.e. the distance between the seed start position and the read start position. (C) Alignment of seeds to a reference genome (black line). Hotspots are represented as gray bars and the number represents the hotspot score. (D) Accelerated Smith–Waterman scoring to detect indels. Only the scoring area near the diagonal (gray shadow) is calculated. The dashed line represents the backtracking path without indel; the brown line depicts the backtracking path with indels; k is the number of allowed errors.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3367211&req=5

gks196-F1: Principle of the FANSe algorithm. Scheme of a read covered by non-overlapping seeds (A) or an additional overlapping seed (B). The dashed lines mark the offset, i.e. the distance between the seed start position and the read start position. (C) Alignment of seeds to a reference genome (black line). Hotspots are represented as gray bars and the number represents the hotspot score. (D) Accelerated Smith–Waterman scoring to detect indels. Only the scoring area near the diagonal (gray shadow) is calculated. The dashed line represents the backtracking path without indel; the brown line depicts the backtracking path with indels; k is the number of allowed errors.

Mentions: Step 1: A read is split into several non-overlapping seeds. Each seed is n-base long with a typical seed size of 6–8 nt (Figure 1A). For reads that are not completely covered by the non-overlapping 6- or 8-nt seeds, an extra seed is taken at the end of the read that overlaps with the penultimate one (Figure 1B).Figure 1.


FANSe: an accurate algorithm for quantitative mapping of large scale sequencing reads.

Zhang G, Fedyunin I, Kirchner S, Xiao C, Valleriani A, Ignatova Z - Nucleic Acids Res. (2012)

Principle of the FANSe algorithm. Scheme of a read covered by non-overlapping seeds (A) or an additional overlapping seed (B). The dashed lines mark the offset, i.e. the distance between the seed start position and the read start position. (C) Alignment of seeds to a reference genome (black line). Hotspots are represented as gray bars and the number represents the hotspot score. (D) Accelerated Smith–Waterman scoring to detect indels. Only the scoring area near the diagonal (gray shadow) is calculated. The dashed line represents the backtracking path without indel; the brown line depicts the backtracking path with indels; k is the number of allowed errors.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3367211&req=5

gks196-F1: Principle of the FANSe algorithm. Scheme of a read covered by non-overlapping seeds (A) or an additional overlapping seed (B). The dashed lines mark the offset, i.e. the distance between the seed start position and the read start position. (C) Alignment of seeds to a reference genome (black line). Hotspots are represented as gray bars and the number represents the hotspot score. (D) Accelerated Smith–Waterman scoring to detect indels. Only the scoring area near the diagonal (gray shadow) is calculated. The dashed line represents the backtracking path without indel; the brown line depicts the backtracking path with indels; k is the number of allowed errors.
Mentions: Step 1: A read is split into several non-overlapping seeds. Each seed is n-base long with a typical seed size of 6–8 nt (Figure 1A). For reads that are not completely covered by the non-overlapping 6- or 8-nt seeds, an extra seed is taken at the end of the read that overlaps with the penultimate one (Figure 1B).Figure 1.

Bottom Line: The most crucial step in data processing from high-throughput sequencing applications is the accurate and sensitive alignment of the sequencing reads to reference genomes or transcriptomes.The accurate detection of insertions and deletions (indels) and errors introduced by the sequencing platform or by misreading of modified nucleotides is essential for the quantitative processing of the RNA-based sequencing (RNA-Seq) datasets and for the identification of genetic variations and modification patterns.It shows a remarkable coverage of low-abundance mRNAs which is important for quantitative processing of RNA-Seq datasets.

View Article: PubMed Central - PubMed

Affiliation: Biochemistry, Institute of Biochemistry and Biology, University of Potsdam, Karl-Liebknecht-Str. 24-25, 14467 Potsdam, Germany. zhanggong@jnu.edu.cn

ABSTRACT
The most crucial step in data processing from high-throughput sequencing applications is the accurate and sensitive alignment of the sequencing reads to reference genomes or transcriptomes. The accurate detection of insertions and deletions (indels) and errors introduced by the sequencing platform or by misreading of modified nucleotides is essential for the quantitative processing of the RNA-based sequencing (RNA-Seq) datasets and for the identification of genetic variations and modification patterns. We developed a new, fast and accurate algorithm for nucleic acid sequence analysis, FANSe, with adjustable mismatch allowance settings and ability to handle indels to accurately and quantitatively map millions of reads to small or large reference genomes. It is a seed-based algorithm which uses the whole read information for mapping and high sensitivity and low ambiguity are achieved by using short and non-overlapping reads. Furthermore, FANSe uses hotspot score to prioritize the processing of highly possible matches and implements modified Smith-Watermann refinement with reduced scoring matrix to accelerate the calculation without compromising its sensitivity. The FANSe algorithm stably processes datasets from various sequencing platforms, masked or unmasked and small or large genomes. It shows a remarkable coverage of low-abundance mRNAs which is important for quantitative processing of RNA-Seq datasets.

Show MeSH
Related in: MedlinePlus