Limits...
PileLine: a toolbox to handle genome position information in next-generation sequencing studies.

Glez-Peña D, Gómez-López G, Reboiro-Jato M, Fdez-Riverola F, Pisano DG - BMC Bioinformatics (2011)

Bottom Line: The structure of these flat files is based on representing one line per position that has been covered by at least one aligned read, imposing significant restrictions from a computational performance perspective.The set of tools comprising PileLine are designed to be memory efficient by performing fast seek on-disk operations over sorted GP files.Our novel toolbox has been extensively tested taking into consideration performance issues.

View Article: PubMed Central - HTML - PubMed

Affiliation: Higher Technical School of Computer Engineering, University of Vigo, Ourense, Spain. dgpena@uvigo.es

ABSTRACT

Background: Genomic position (GP) files currently used in next-generation sequencing (NGS) studies are always difficult to manipulate due to their huge size and the lack of appropriate tools to properly manage them. The structure of these flat files is based on representing one line per position that has been covered by at least one aligned read, imposing significant restrictions from a computational performance perspective.

Results: PileLine implements a flexible command-line toolkit providing specific support to the management, filtering, comparison and annotation of GP files produced by NGS experiments. PileLine tools are coded in Java and run on both UNIX (Linux, Mac OS) and Windows platforms. The set of tools comprising PileLine are designed to be memory efficient by performing fast seek on-disk operations over sorted GP files.

Conclusions: Our novel toolbox has been extensively tested taking into consideration performance issues. It is publicly available at http://sourceforge.net/projects/pilelinetools under the GNU LGPL license. Full documentation including common use cases and guided analysis workflows is available at http://sing.ei.uvigo.es/pileline.

Show MeSH
An example of a PileLine workflow working with SAMTools. .pileup files generated by SAMtools are processed by PileLine in order to perform QC tests, annotation and sample comparison (NS: non-synonymous).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3037855&req=5

Figure 1: An example of a PileLine workflow working with SAMTools. .pileup files generated by SAMtools are processed by PileLine in order to perform QC tests, annotation and sample comparison (NS: non-synonymous).

Mentions: The primary input data of PileLine are GP files (e.g. .pileup from SAMtools) containing the chromosome name and the coordinate position as the two first columns (see Figure 1). The main design principle of the PileLine toolbox is to avoid loading input data into memory, so core functions operate directly on disk. One of the available command-line tools is fastseek, which performs a direct binary search on sorted GP files without requiring an additional index to be created. This functionality provides direct access to any range of genomic coordinates without loading the whole file into memory. Initially, fastseek finds the first and last lines of each sequence and next, performs a binary search on the lines belonging to the queried sequence in order to find the first position within the specified range.


PileLine: a toolbox to handle genome position information in next-generation sequencing studies.

Glez-Peña D, Gómez-López G, Reboiro-Jato M, Fdez-Riverola F, Pisano DG - BMC Bioinformatics (2011)

An example of a PileLine workflow working with SAMTools. .pileup files generated by SAMtools are processed by PileLine in order to perform QC tests, annotation and sample comparison (NS: non-synonymous).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3037855&req=5

Figure 1: An example of a PileLine workflow working with SAMTools. .pileup files generated by SAMtools are processed by PileLine in order to perform QC tests, annotation and sample comparison (NS: non-synonymous).
Mentions: The primary input data of PileLine are GP files (e.g. .pileup from SAMtools) containing the chromosome name and the coordinate position as the two first columns (see Figure 1). The main design principle of the PileLine toolbox is to avoid loading input data into memory, so core functions operate directly on disk. One of the available command-line tools is fastseek, which performs a direct binary search on sorted GP files without requiring an additional index to be created. This functionality provides direct access to any range of genomic coordinates without loading the whole file into memory. Initially, fastseek finds the first and last lines of each sequence and next, performs a binary search on the lines belonging to the queried sequence in order to find the first position within the specified range.

Bottom Line: The structure of these flat files is based on representing one line per position that has been covered by at least one aligned read, imposing significant restrictions from a computational performance perspective.The set of tools comprising PileLine are designed to be memory efficient by performing fast seek on-disk operations over sorted GP files.Our novel toolbox has been extensively tested taking into consideration performance issues.

View Article: PubMed Central - HTML - PubMed

Affiliation: Higher Technical School of Computer Engineering, University of Vigo, Ourense, Spain. dgpena@uvigo.es

ABSTRACT

Background: Genomic position (GP) files currently used in next-generation sequencing (NGS) studies are always difficult to manipulate due to their huge size and the lack of appropriate tools to properly manage them. The structure of these flat files is based on representing one line per position that has been covered by at least one aligned read, imposing significant restrictions from a computational performance perspective.

Results: PileLine implements a flexible command-line toolkit providing specific support to the management, filtering, comparison and annotation of GP files produced by NGS experiments. PileLine tools are coded in Java and run on both UNIX (Linux, Mac OS) and Windows platforms. The set of tools comprising PileLine are designed to be memory efficient by performing fast seek on-disk operations over sorted GP files.

Conclusions: Our novel toolbox has been extensively tested taking into consideration performance issues. It is publicly available at http://sourceforge.net/projects/pilelinetools under the GNU LGPL license. Full documentation including common use cases and guided analysis workflows is available at http://sing.ei.uvigo.es/pileline.

Show MeSH