Limits...
h5vc: scalable nucleotide tallies with HDF5.

Pyl PT, Gehring J, Fischer B, Huber W - Bioinformatics (2014)

Bottom Line: Here we present h5vc, a data structure and associated software for managing tallies.Through the simplicity of its API, we envision making low-level analysis of large sets of genome sequencing data accessible to a wider range of researchers.The HDF5 system is used as the core of our implementation. pyl@embl.de or whuber@embl.de Supplementary data are available at Bioinformatics online.

View Article: PubMed Central - PubMed

Affiliation: EMBL Heidelberg, Genome Biology Unit, Meyerhofstr. 1, 69117 Heidelberg, Germany.

ABSTRACT

Summary: As applications of genome sequencing, including exomes and whole genomes, are expanding, there is a need for analysis tools that are scalable to large sets of samples and/or ultra-deep coverage. Many current tool chains are based on the widely used file formats BAM and VCF or VCF-derivatives. However, for some desirable analyses, data management with these formats creates substantial implementation overhead, and much time is spent parsing files and collating data. We observe that a tally data structure, i.e. the table of counts of nucleotides × samples × strands × genomic positions, provides a reasonable intermediate level of abstraction for many genomics analyses, including single nucleotide variant (SNV) and InDel calling, copy-number estimation and mutation spectrum analysis. Here we present h5vc, a data structure and associated software for managing tallies. The software contains functionality for creating tallies from BAM files, flexible and scalable data visualization, data quality assessment, computing statistics relevant to variant calling and other applications. Through the simplicity of its API, we envision making low-level analysis of large sets of genome sequencing data accessible to a wider range of researchers.

Availability and implementation: The package H5VC for the statistical environment R is available through the Bioconductor project. The HDF5 system is used as the core of our implementation.

Contact: pyl@embl.de or whuber@embl.de

Supplementary information: Supplementary data are available at Bioinformatics online.

Show MeSH

Related in: MedlinePlus

mismatchPlots of two candidate variant sites. Each sample (Control, Tumor) is shown in a separate panel, with the genomic position as a common x-axis centered around the position of the variant. Along the y-axis, alignment statistics of the forward and reverse strand are shown as positive and negative values, respectively. Gray areas represent coverage by sequences matching the reference, and colored areas represent mismatches, deletions and insertions. (a) Variant is present in the tumor sample but not in the control. (b) Variant of comparable position specific statistics as (a). Note the noisiness of the region, which is not immediately obvious from the position-specific values alone
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4016699&req=5

btu026-F1: mismatchPlots of two candidate variant sites. Each sample (Control, Tumor) is shown in a separate panel, with the genomic position as a common x-axis centered around the position of the variant. Along the y-axis, alignment statistics of the forward and reverse strand are shown as positive and negative values, respectively. Gray areas represent coverage by sequences matching the reference, and colored areas represent mismatches, deletions and insertions. (a) Variant is present in the tumor sample but not in the control. (b) Variant of comparable position specific statistics as (a). Note the noisiness of the region, which is not immediately obvious from the position-specific values alone

Mentions: An important part of variant calling is quality control. After automated procedures have reduced the number of potential variant sites to a manageable scale, rapid visualization of those sites can be instrumental for assessing the performance of the algorithms used. One of the most informative visualizations for a limited set of samples is the mismatchPlot (Fig. 1). It shows the coverage, mismatches and deleted bases for each sample in a genomic region and is generated directly from the tally data.Fig. 1.


h5vc: scalable nucleotide tallies with HDF5.

Pyl PT, Gehring J, Fischer B, Huber W - Bioinformatics (2014)

mismatchPlots of two candidate variant sites. Each sample (Control, Tumor) is shown in a separate panel, with the genomic position as a common x-axis centered around the position of the variant. Along the y-axis, alignment statistics of the forward and reverse strand are shown as positive and negative values, respectively. Gray areas represent coverage by sequences matching the reference, and colored areas represent mismatches, deletions and insertions. (a) Variant is present in the tumor sample but not in the control. (b) Variant of comparable position specific statistics as (a). Note the noisiness of the region, which is not immediately obvious from the position-specific values alone
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4016699&req=5

btu026-F1: mismatchPlots of two candidate variant sites. Each sample (Control, Tumor) is shown in a separate panel, with the genomic position as a common x-axis centered around the position of the variant. Along the y-axis, alignment statistics of the forward and reverse strand are shown as positive and negative values, respectively. Gray areas represent coverage by sequences matching the reference, and colored areas represent mismatches, deletions and insertions. (a) Variant is present in the tumor sample but not in the control. (b) Variant of comparable position specific statistics as (a). Note the noisiness of the region, which is not immediately obvious from the position-specific values alone
Mentions: An important part of variant calling is quality control. After automated procedures have reduced the number of potential variant sites to a manageable scale, rapid visualization of those sites can be instrumental for assessing the performance of the algorithms used. One of the most informative visualizations for a limited set of samples is the mismatchPlot (Fig. 1). It shows the coverage, mismatches and deleted bases for each sample in a genomic region and is generated directly from the tally data.Fig. 1.

Bottom Line: Here we present h5vc, a data structure and associated software for managing tallies.Through the simplicity of its API, we envision making low-level analysis of large sets of genome sequencing data accessible to a wider range of researchers.The HDF5 system is used as the core of our implementation. pyl@embl.de or whuber@embl.de Supplementary data are available at Bioinformatics online.

View Article: PubMed Central - PubMed

Affiliation: EMBL Heidelberg, Genome Biology Unit, Meyerhofstr. 1, 69117 Heidelberg, Germany.

ABSTRACT

Summary: As applications of genome sequencing, including exomes and whole genomes, are expanding, there is a need for analysis tools that are scalable to large sets of samples and/or ultra-deep coverage. Many current tool chains are based on the widely used file formats BAM and VCF or VCF-derivatives. However, for some desirable analyses, data management with these formats creates substantial implementation overhead, and much time is spent parsing files and collating data. We observe that a tally data structure, i.e. the table of counts of nucleotides × samples × strands × genomic positions, provides a reasonable intermediate level of abstraction for many genomics analyses, including single nucleotide variant (SNV) and InDel calling, copy-number estimation and mutation spectrum analysis. Here we present h5vc, a data structure and associated software for managing tallies. The software contains functionality for creating tallies from BAM files, flexible and scalable data visualization, data quality assessment, computing statistics relevant to variant calling and other applications. Through the simplicity of its API, we envision making low-level analysis of large sets of genome sequencing data accessible to a wider range of researchers.

Availability and implementation: The package H5VC for the statistical environment R is available through the Bioconductor project. The HDF5 system is used as the core of our implementation.

Contact: pyl@embl.de or whuber@embl.de

Supplementary information: Supplementary data are available at Bioinformatics online.

Show MeSH
Related in: MedlinePlus