Limits...
ANDES: Statistical tools for the ANalyses of DEep Sequencing.

Li K, Venter E, Yooseph S, Stockwell TB, Eckerle LD, Denison MR, Spiro DJ, Methé BA - BMC Res Notes (2010)

Bottom Line: Tools include the root mean square deviation (RMSD) plot, which allows for the visual comparison of multiple samples on a position-by-position basis, and the computation of base conversion frequencies (transition/transversion rates), variation (Shannon entropy), inter-sample clustering and visualization (dendrogram and multidimensional scaling (MDS) plot), threshold-driven consensus sequence generation and polymorphism detection, and the estimation of empirically determined sequencing quality values.As new sequencing technologies evolve, deep sequencing will become increasingly cost-efficient and the inter and intra-sample comparisons of largely homogeneous sequences will become more common.We have provided a software package and demonstrated its application on various empirically-derived datasets.

View Article: PubMed Central - HTML - PubMed

Affiliation: The J, Craig Venter Institute, 9704 Medical Center Drive, Rockville, MD 20850, USA. kli@jcvi.org.

ABSTRACT

Background: The advancements in DNA sequencing technologies have allowed researchers to progress from the analyses of a single organism towards the deep sequencing of a sample of organisms. With sufficient sequencing depth, it is now possible to detect subtle variations between members of the same species, or between mixed species with shared biomarkers, such as the 16S rRNA gene. However, traditional sequencing analyses of samples from largely homogeneous populations are often still based on multiple sequence alignments (MSA), where each sequence is placed along a separate row and similarities between aligned bases can be followed down each column. While this visual format is intuitive for a small set of aligned sequences, the representation quickly becomes cumbersome as sequencing depths cover loci hundreds or thousands of reads deep.

Findings: We have developed ANDES, a software library and a suite of applications, written in Perl and R, for the statistical ANalyses of DEep Sequencing. The fundamental data structure underlying ANDES is the position profile, which contains the nucleotide distributions for each genomic position resultant from a multiple sequence alignment (MSA). Tools include the root mean square deviation (RMSD) plot, which allows for the visual comparison of multiple samples on a position-by-position basis, and the computation of base conversion frequencies (transition/transversion rates), variation (Shannon entropy), inter-sample clustering and visualization (dendrogram and multidimensional scaling (MDS) plot), threshold-driven consensus sequence generation and polymorphism detection, and the estimation of empirically determined sequencing quality values.

Conclusions: As new sequencing technologies evolve, deep sequencing will become increasingly cost-efficient and the inter and intra-sample comparisons of largely homogeneous sequences will become more common. We have provided a software package and demonstrated its application on various empirically-derived datasets. Investigators may download the software from Sourceforge at https://sourceforge.net/projects/andestools.

No MeSH data available.


Example position profile for the MSA shown in Figure 1. Column 1 contains the most common allele. When multiple alleles share the value of the greatest frequency, one of the alleles is arbitrarily chosen to represent that position. Columns 2-5 represent the frequency of each nucleotide at each position for the sample represented by the position profile.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2921379&req=5

Figure 2: Example position profile for the MSA shown in Figure 1. Column 1 contains the most common allele. When multiple alleles share the value of the greatest frequency, one of the alleles is arbitrarily chosen to represent that position. Columns 2-5 represent the frequency of each nucleotide at each position for the sample represented by the position profile.

Mentions: The fundamental data structure of the ANDES suite of tools is the position profile. This data structure contains the number of nucleotides that support each position of the analyzed region of interest. ANDES currently supports the conversion from the output of clustalw[6] (i.e. .aln files), however the format of the position profile is relatively trivial; therefore the encoding of output from an alternative alignment application should be readily supported. Alignments generated with RazerS[7], MUSCLE [8], and AMOScmp [9] tools have also been successfully converted into position profiles and analyzed. Figure 1 shows the input MSA that the resultant position profile in Figure 2 was created from. Each column in the position profile is calculated by summing the number of nucleotides or gap contributions at each position. Each line of the position profile is essentially a probability mass function (pmf), with support at {A, T, G, C, and - (gap)}, when the values are normalized and sum to 1. Gaps introduced at the beginning or end of the MSA are assumed to be caused by read termination, so these gaps are not considered to be nucleotide variations and are therefore ignored. Bases with IUPAC ambiguity codes [10] in the alignment are given equal weight for each nucleotide they represent. For example, the IUPAC code, M, represents A or C, so 0.5 will be contributed to both A and C position counts. The major allele is also stored in the profile for visual convenience. The number of lines in the position profile is the gapped length of the consensus sequence for the MSA.


ANDES: Statistical tools for the ANalyses of DEep Sequencing.

Li K, Venter E, Yooseph S, Stockwell TB, Eckerle LD, Denison MR, Spiro DJ, Methé BA - BMC Res Notes (2010)

Example position profile for the MSA shown in Figure 1. Column 1 contains the most common allele. When multiple alleles share the value of the greatest frequency, one of the alleles is arbitrarily chosen to represent that position. Columns 2-5 represent the frequency of each nucleotide at each position for the sample represented by the position profile.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2921379&req=5

Figure 2: Example position profile for the MSA shown in Figure 1. Column 1 contains the most common allele. When multiple alleles share the value of the greatest frequency, one of the alleles is arbitrarily chosen to represent that position. Columns 2-5 represent the frequency of each nucleotide at each position for the sample represented by the position profile.
Mentions: The fundamental data structure of the ANDES suite of tools is the position profile. This data structure contains the number of nucleotides that support each position of the analyzed region of interest. ANDES currently supports the conversion from the output of clustalw[6] (i.e. .aln files), however the format of the position profile is relatively trivial; therefore the encoding of output from an alternative alignment application should be readily supported. Alignments generated with RazerS[7], MUSCLE [8], and AMOScmp [9] tools have also been successfully converted into position profiles and analyzed. Figure 1 shows the input MSA that the resultant position profile in Figure 2 was created from. Each column in the position profile is calculated by summing the number of nucleotides or gap contributions at each position. Each line of the position profile is essentially a probability mass function (pmf), with support at {A, T, G, C, and - (gap)}, when the values are normalized and sum to 1. Gaps introduced at the beginning or end of the MSA are assumed to be caused by read termination, so these gaps are not considered to be nucleotide variations and are therefore ignored. Bases with IUPAC ambiguity codes [10] in the alignment are given equal weight for each nucleotide they represent. For example, the IUPAC code, M, represents A or C, so 0.5 will be contributed to both A and C position counts. The major allele is also stored in the profile for visual convenience. The number of lines in the position profile is the gapped length of the consensus sequence for the MSA.

Bottom Line: Tools include the root mean square deviation (RMSD) plot, which allows for the visual comparison of multiple samples on a position-by-position basis, and the computation of base conversion frequencies (transition/transversion rates), variation (Shannon entropy), inter-sample clustering and visualization (dendrogram and multidimensional scaling (MDS) plot), threshold-driven consensus sequence generation and polymorphism detection, and the estimation of empirically determined sequencing quality values.As new sequencing technologies evolve, deep sequencing will become increasingly cost-efficient and the inter and intra-sample comparisons of largely homogeneous sequences will become more common.We have provided a software package and demonstrated its application on various empirically-derived datasets.

View Article: PubMed Central - HTML - PubMed

Affiliation: The J, Craig Venter Institute, 9704 Medical Center Drive, Rockville, MD 20850, USA. kli@jcvi.org.

ABSTRACT

Background: The advancements in DNA sequencing technologies have allowed researchers to progress from the analyses of a single organism towards the deep sequencing of a sample of organisms. With sufficient sequencing depth, it is now possible to detect subtle variations between members of the same species, or between mixed species with shared biomarkers, such as the 16S rRNA gene. However, traditional sequencing analyses of samples from largely homogeneous populations are often still based on multiple sequence alignments (MSA), where each sequence is placed along a separate row and similarities between aligned bases can be followed down each column. While this visual format is intuitive for a small set of aligned sequences, the representation quickly becomes cumbersome as sequencing depths cover loci hundreds or thousands of reads deep.

Findings: We have developed ANDES, a software library and a suite of applications, written in Perl and R, for the statistical ANalyses of DEep Sequencing. The fundamental data structure underlying ANDES is the position profile, which contains the nucleotide distributions for each genomic position resultant from a multiple sequence alignment (MSA). Tools include the root mean square deviation (RMSD) plot, which allows for the visual comparison of multiple samples on a position-by-position basis, and the computation of base conversion frequencies (transition/transversion rates), variation (Shannon entropy), inter-sample clustering and visualization (dendrogram and multidimensional scaling (MDS) plot), threshold-driven consensus sequence generation and polymorphism detection, and the estimation of empirically determined sequencing quality values.

Conclusions: As new sequencing technologies evolve, deep sequencing will become increasingly cost-efficient and the inter and intra-sample comparisons of largely homogeneous sequences will become more common. We have provided a software package and demonstrated its application on various empirically-derived datasets. Investigators may download the software from Sourceforge at https://sourceforge.net/projects/andestools.

No MeSH data available.