Limits...
XS: a FASTQ read simulator.

Pratas D, Pinho AJ, Rodrigues JM - BMC Res Notes (2014)

Bottom Line: We present XS, a skilled FASTQ read simulation tool, flexible, portable (does not need a reference sequence) and tunable in terms of sequence complexity.It has several running modes, depending on the time and memory available, and is aimed at testing computing infrastructures, namely cloud computing of large-scale projects, and testing FASTQ compression algorithms.Moreover, XS offers the possibility of simulating the three main FASTQ components individually (headers, DNA sequences and quality-scores).

View Article: PubMed Central - HTML - PubMed

Affiliation: Signal Processing Lab, IEETA/DETI University of Aveiro, Aveiro 3810-193, Portugal. pratas@ua.pt.

ABSTRACT

Background: The emerging next-generation sequencing (NGS) is bringing, besides the natural huge amounts of data, an avalanche of new specialized tools (for analysis, compression, alignment, among others) and large public and private network infrastructures. Therefore, a direct necessity of specific simulation tools for testing and benchmarking is rising, such as a flexible and portable FASTQ read simulator, without the need of a reference sequence, yet correctly prepared for producing approximately the same characteristics as real data.

Findings: We present XS, a skilled FASTQ read simulation tool, flexible, portable (does not need a reference sequence) and tunable in terms of sequence complexity. It has several running modes, depending on the time and memory available, and is aimed at testing computing infrastructures, namely cloud computing of large-scale projects, and testing FASTQ compression algorithms. Moreover, XS offers the possibility of simulating the three main FASTQ components individually (headers, DNA sequences and quality-scores).

Conclusions: XS provides an efficient and convenient method for fast simulation of FASTQ files, such as those from Ion Torrent (currently uncovered by other simulators), Roche-454, Illumina and ABI-SOLiD sequencing machines. This tool is publicly available at http://bioinformatics.ua.pt/software/xs/.

Show MeSH
Information profiles of original and simulated DNA sequences. The first plot shows 10 Mbp of a regular DNA FASTQ sequence. The rest, also with 10 Mbp, are simulated sequences, respectively, with 0, 10 and 350 repeats, and 350 reverse complemented repeats (repeats minimum size: 1, maximum size: 3000, mutation rate: 0.1). All information profiles have been computed with an encoder based on multiple Markov models and filtered with a window size of 5, from left to right.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3927261&req=5

Figure 3: Information profiles of original and simulated DNA sequences. The first plot shows 10 Mbp of a regular DNA FASTQ sequence. The rest, also with 10 Mbp, are simulated sequences, respectively, with 0, 10 and 350 repeats, and 350 reverse complemented repeats (repeats minimum size: 1, maximum size: 3000, mutation rate: 0.1). All information profiles have been computed with an encoder based on multiple Markov models and filtered with a window size of 5, from left to right.

Mentions: Unlike other approaches, XS has two ways to simulate the DNA bases from scratch, without the need for a reference sequence, but maintaining the main characteristics of this type of sequence and therefore improving the portability of the simulator. One is setting both their length and their percentage of nucleotide composition, leading to fast and low memory usage simulation. The other, is an extension of the first, which adds simulation of repeats according to parts of DNA seen in the past. These repeats, exemplified in Figure 2, can be exact or approximate, depending on the average level of mutation selected, ranging between a custom minimum and maximum size. Moreover, there is also the possibility to use the reverse complement repeats, where they can also be exact or approximate copies. An example can be seen in Figure 3, with information profiles assessment [35,36], where the zones of low complexity are related with similarity, and hence, exact or approximate repetitions seen in other parts of the sequence, as in the first plot (real DNA FASTQ sequence). The second plot shows a simulated sequence with absence of repeats, and hence, the zones of low complexity are also absent. In the third, fourth and fifth information profiles, there are regions with low information profiles since the simulator used 10 and 350 repeats. Only the encoder was set to detect reverse complement repeats (fifth information profile) and unset (third and fourth) in order to assess this characteristic of the simulator.


XS: a FASTQ read simulator.

Pratas D, Pinho AJ, Rodrigues JM - BMC Res Notes (2014)

Information profiles of original and simulated DNA sequences. The first plot shows 10 Mbp of a regular DNA FASTQ sequence. The rest, also with 10 Mbp, are simulated sequences, respectively, with 0, 10 and 350 repeats, and 350 reverse complemented repeats (repeats minimum size: 1, maximum size: 3000, mutation rate: 0.1). All information profiles have been computed with an encoder based on multiple Markov models and filtered with a window size of 5, from left to right.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3927261&req=5

Figure 3: Information profiles of original and simulated DNA sequences. The first plot shows 10 Mbp of a regular DNA FASTQ sequence. The rest, also with 10 Mbp, are simulated sequences, respectively, with 0, 10 and 350 repeats, and 350 reverse complemented repeats (repeats minimum size: 1, maximum size: 3000, mutation rate: 0.1). All information profiles have been computed with an encoder based on multiple Markov models and filtered with a window size of 5, from left to right.
Mentions: Unlike other approaches, XS has two ways to simulate the DNA bases from scratch, without the need for a reference sequence, but maintaining the main characteristics of this type of sequence and therefore improving the portability of the simulator. One is setting both their length and their percentage of nucleotide composition, leading to fast and low memory usage simulation. The other, is an extension of the first, which adds simulation of repeats according to parts of DNA seen in the past. These repeats, exemplified in Figure 2, can be exact or approximate, depending on the average level of mutation selected, ranging between a custom minimum and maximum size. Moreover, there is also the possibility to use the reverse complement repeats, where they can also be exact or approximate copies. An example can be seen in Figure 3, with information profiles assessment [35,36], where the zones of low complexity are related with similarity, and hence, exact or approximate repetitions seen in other parts of the sequence, as in the first plot (real DNA FASTQ sequence). The second plot shows a simulated sequence with absence of repeats, and hence, the zones of low complexity are also absent. In the third, fourth and fifth information profiles, there are regions with low information profiles since the simulator used 10 and 350 repeats. Only the encoder was set to detect reverse complement repeats (fifth information profile) and unset (third and fourth) in order to assess this characteristic of the simulator.

Bottom Line: We present XS, a skilled FASTQ read simulation tool, flexible, portable (does not need a reference sequence) and tunable in terms of sequence complexity.It has several running modes, depending on the time and memory available, and is aimed at testing computing infrastructures, namely cloud computing of large-scale projects, and testing FASTQ compression algorithms.Moreover, XS offers the possibility of simulating the three main FASTQ components individually (headers, DNA sequences and quality-scores).

View Article: PubMed Central - HTML - PubMed

Affiliation: Signal Processing Lab, IEETA/DETI University of Aveiro, Aveiro 3810-193, Portugal. pratas@ua.pt.

ABSTRACT

Background: The emerging next-generation sequencing (NGS) is bringing, besides the natural huge amounts of data, an avalanche of new specialized tools (for analysis, compression, alignment, among others) and large public and private network infrastructures. Therefore, a direct necessity of specific simulation tools for testing and benchmarking is rising, such as a flexible and portable FASTQ read simulator, without the need of a reference sequence, yet correctly prepared for producing approximately the same characteristics as real data.

Findings: We present XS, a skilled FASTQ read simulation tool, flexible, portable (does not need a reference sequence) and tunable in terms of sequence complexity. It has several running modes, depending on the time and memory available, and is aimed at testing computing infrastructures, namely cloud computing of large-scale projects, and testing FASTQ compression algorithms. Moreover, XS offers the possibility of simulating the three main FASTQ components individually (headers, DNA sequences and quality-scores).

Conclusions: XS provides an efficient and convenient method for fast simulation of FASTQ files, such as those from Ion Torrent (currently uncovered by other simulators), Roche-454, Illumina and ABI-SOLiD sequencing machines. This tool is publicly available at http://bioinformatics.ua.pt/software/xs/.

Show MeSH