Limits...
HTJoinSolver: Human immunoglobulin VDJ partitioning using approximate dynamic programming constrained by conserved motifs.

Russ DE, Ho KY, Longo NS - BMC Bioinformatics (2015)

Bottom Line: HTJoinSolver can rapidly identify V- and J-segments with indels to high accuracy for mutated sequences when the mutation probability is around 30% and 20% respectively.The D-segment is much harder to fit even at 20% mutation probability.For all segments, the probability of correctly matching V, D, and J increases with our alignment score.

View Article: PubMed Central - PubMed

Affiliation: Division of Computational Bioscience, Center for Information Technology, NIH, 12 South Drive, Bethesda, MD, 20892, USA. druss@mail.nih.gov.

ABSTRACT

Background: Partitioning the human immunoglobulin variable region into variable (V), diversity (D), and joining (J) segments is a common sequence analysis step. We introduce a novel approximate dynamic programming method that uses conserved immunoglobulin gene motifs to improve performance of aligning V-segments of rearranged immunoglobulin (Ig) genes. Our new algorithm enhances the former JOINSOLVER algorithm by processing sequences with insertions and/or deletions (indels) and improves the efficiency for large datasets provided by high throughput sequencing.

Results: In our simulations, which include rearrangements with indels, the V-matching success rate improved from 61% for partial alignments of sequences with indels in the original algorithm to over 99% in the approximate algorithm. An improvement in the alignment of human VDJ rearrangements over the initial JOINSOLVER algorithm was also seen when compared to the Stanford.S22 human Ig dataset with an online VDJ partitioning software evaluation tool.

Conclusions: HTJoinSolver can rapidly identify V- and J-segments with indels to high accuracy for mutated sequences when the mutation probability is around 30% and 20% respectively. The D-segment is much harder to fit even at 20% mutation probability. For all segments, the probability of correctly matching V, D, and J increases with our alignment score.

Show MeSH

Related in: MedlinePlus

Matrix calculation of the alignment score for a sequence with a mutation or indel. Matrix calculation of the alignment score for a sequence with a mutation or indel. (a) Matrix for a single nucleotide mismatch. (b) Matrix with a two-base insertion (CG > CAAG). (c) Matrix with a two-base deletion (TC > −−). The dynamic programming matrix for the approximate backwards algorithm begins at the initial T of the VH-motif (last row and column, score = 0). The algorithm goes backwards along the diagonal until it hits a mismatch, in which the algorithm backs up a step and generates a submatrix (solid lines). The algorithm can choose to step up (deletion), step to the left (insertion), or continue diagonally (match/mismatch). For a deletion or insertion, the score initially decrease by 10, but subsequent indels have a score decrease of 4. Matches increase the score by 5, and mismatches decrease the score by 4. The maximum score in the first column or row (bold box) is selected (circled). The algorithm continues stepping backwards on the diagonal. Backtraces are shown as arrows, and label the alignment of the sequences.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4492005&req=5

Fig2: Matrix calculation of the alignment score for a sequence with a mutation or indel. Matrix calculation of the alignment score for a sequence with a mutation or indel. (a) Matrix for a single nucleotide mismatch. (b) Matrix with a two-base insertion (CG > CAAG). (c) Matrix with a two-base deletion (TC > −−). The dynamic programming matrix for the approximate backwards algorithm begins at the initial T of the VH-motif (last row and column, score = 0). The algorithm goes backwards along the diagonal until it hits a mismatch, in which the algorithm backs up a step and generates a submatrix (solid lines). The algorithm can choose to step up (deletion), step to the left (insertion), or continue diagonally (match/mismatch). For a deletion or insertion, the score initially decrease by 10, but subsequent indels have a score decrease of 4. Matches increase the score by 5, and mismatches decrease the score by 4. The maximum score in the first column or row (bold box) is selected (circled). The algorithm continues stepping backwards on the diagonal. Backtraces are shown as arrows, and label the alignment of the sequences.

Mentions: Figure 2 demonstrates how the algorithm aligns sequences. In the examples provided in Figure 2, the starting point uses the most 5’ T from the conserved V-motif, TAT TAC TGT. The germline and unknown sequences match for the next 3 bases (G, T, and G), so the algorithm walks along the diagonal in the 5’ direction of the DP matrix. When a mismatch occurs, it traces back and calculates the score over a small rectangular submatrix. In Figure 2a, which corresponds to a single C to A mutation, the high score traces straight along the diagonal. In Figure 2b, which corresponds to a two base insert (AA), the traceback has two steps down. Figure 2c, which has a two base deletion (TC), the traceback has two right steps. In all the examples in the figure, arrows represent the trace backs, the top rows and first columns are shown in bold, and the highest scores are circled. The algorithm continues stepping back in the 5’ direction along the diagonal from the high scoring element. Finally, the algorithm terminates at the top left matrix element. To conserve space in the figure, the algorithm traced back 1 step and a 5 × 5 square matrix was calculated. However, these parameters are too small to avoid falling into local maxima, thus, in actuality, the algorithm uses a 40 × 40 square matrix on a mismatch, and traces back 10 steps.Figure 2


HTJoinSolver: Human immunoglobulin VDJ partitioning using approximate dynamic programming constrained by conserved motifs.

Russ DE, Ho KY, Longo NS - BMC Bioinformatics (2015)

Matrix calculation of the alignment score for a sequence with a mutation or indel. Matrix calculation of the alignment score for a sequence with a mutation or indel. (a) Matrix for a single nucleotide mismatch. (b) Matrix with a two-base insertion (CG > CAAG). (c) Matrix with a two-base deletion (TC > −−). The dynamic programming matrix for the approximate backwards algorithm begins at the initial T of the VH-motif (last row and column, score = 0). The algorithm goes backwards along the diagonal until it hits a mismatch, in which the algorithm backs up a step and generates a submatrix (solid lines). The algorithm can choose to step up (deletion), step to the left (insertion), or continue diagonally (match/mismatch). For a deletion or insertion, the score initially decrease by 10, but subsequent indels have a score decrease of 4. Matches increase the score by 5, and mismatches decrease the score by 4. The maximum score in the first column or row (bold box) is selected (circled). The algorithm continues stepping backwards on the diagonal. Backtraces are shown as arrows, and label the alignment of the sequences.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4492005&req=5

Fig2: Matrix calculation of the alignment score for a sequence with a mutation or indel. Matrix calculation of the alignment score for a sequence with a mutation or indel. (a) Matrix for a single nucleotide mismatch. (b) Matrix with a two-base insertion (CG > CAAG). (c) Matrix with a two-base deletion (TC > −−). The dynamic programming matrix for the approximate backwards algorithm begins at the initial T of the VH-motif (last row and column, score = 0). The algorithm goes backwards along the diagonal until it hits a mismatch, in which the algorithm backs up a step and generates a submatrix (solid lines). The algorithm can choose to step up (deletion), step to the left (insertion), or continue diagonally (match/mismatch). For a deletion or insertion, the score initially decrease by 10, but subsequent indels have a score decrease of 4. Matches increase the score by 5, and mismatches decrease the score by 4. The maximum score in the first column or row (bold box) is selected (circled). The algorithm continues stepping backwards on the diagonal. Backtraces are shown as arrows, and label the alignment of the sequences.
Mentions: Figure 2 demonstrates how the algorithm aligns sequences. In the examples provided in Figure 2, the starting point uses the most 5’ T from the conserved V-motif, TAT TAC TGT. The germline and unknown sequences match for the next 3 bases (G, T, and G), so the algorithm walks along the diagonal in the 5’ direction of the DP matrix. When a mismatch occurs, it traces back and calculates the score over a small rectangular submatrix. In Figure 2a, which corresponds to a single C to A mutation, the high score traces straight along the diagonal. In Figure 2b, which corresponds to a two base insert (AA), the traceback has two steps down. Figure 2c, which has a two base deletion (TC), the traceback has two right steps. In all the examples in the figure, arrows represent the trace backs, the top rows and first columns are shown in bold, and the highest scores are circled. The algorithm continues stepping back in the 5’ direction along the diagonal from the high scoring element. Finally, the algorithm terminates at the top left matrix element. To conserve space in the figure, the algorithm traced back 1 step and a 5 × 5 square matrix was calculated. However, these parameters are too small to avoid falling into local maxima, thus, in actuality, the algorithm uses a 40 × 40 square matrix on a mismatch, and traces back 10 steps.Figure 2

Bottom Line: HTJoinSolver can rapidly identify V- and J-segments with indels to high accuracy for mutated sequences when the mutation probability is around 30% and 20% respectively.The D-segment is much harder to fit even at 20% mutation probability.For all segments, the probability of correctly matching V, D, and J increases with our alignment score.

View Article: PubMed Central - PubMed

Affiliation: Division of Computational Bioscience, Center for Information Technology, NIH, 12 South Drive, Bethesda, MD, 20892, USA. druss@mail.nih.gov.

ABSTRACT

Background: Partitioning the human immunoglobulin variable region into variable (V), diversity (D), and joining (J) segments is a common sequence analysis step. We introduce a novel approximate dynamic programming method that uses conserved immunoglobulin gene motifs to improve performance of aligning V-segments of rearranged immunoglobulin (Ig) genes. Our new algorithm enhances the former JOINSOLVER algorithm by processing sequences with insertions and/or deletions (indels) and improves the efficiency for large datasets provided by high throughput sequencing.

Results: In our simulations, which include rearrangements with indels, the V-matching success rate improved from 61% for partial alignments of sequences with indels in the original algorithm to over 99% in the approximate algorithm. An improvement in the alignment of human VDJ rearrangements over the initial JOINSOLVER algorithm was also seen when compared to the Stanford.S22 human Ig dataset with an online VDJ partitioning software evaluation tool.

Conclusions: HTJoinSolver can rapidly identify V- and J-segments with indels to high accuracy for mutated sequences when the mutation probability is around 30% and 20% respectively. The D-segment is much harder to fit even at 20% mutation probability. For all segments, the probability of correctly matching V, D, and J increases with our alignment score.

Show MeSH
Related in: MedlinePlus