HTJoinSolver: Human immunoglobulin VDJ partitioning using approximate dynamic programming constrained by conserved motifs.
Bottom Line:
HTJoinSolver can rapidly identify V- and J-segments with indels to high accuracy for mutated sequences when the mutation probability is around 30% and 20% respectively.The D-segment is much harder to fit even at 20% mutation probability.For all segments, the probability of correctly matching V, D, and J increases with our alignment score.
View Article:
PubMed Central - PubMed
Affiliation: Division of Computational Bioscience, Center for Information Technology, NIH, 12 South Drive, Bethesda, MD, 20892, USA. druss@mail.nih.gov.
ABSTRACT
Show MeSH
Background: Partitioning the human immunoglobulin variable region into variable (V), diversity (D), and joining (J) segments is a common sequence analysis step. We introduce a novel approximate dynamic programming method that uses conserved immunoglobulin gene motifs to improve performance of aligning V-segments of rearranged immunoglobulin (Ig) genes. Our new algorithm enhances the former JOINSOLVER algorithm by processing sequences with insertions and/or deletions (indels) and improves the efficiency for large datasets provided by high throughput sequencing. Results: In our simulations, which include rearrangements with indels, the V-matching success rate improved from 61% for partial alignments of sequences with indels in the original algorithm to over 99% in the approximate algorithm. An improvement in the alignment of human VDJ rearrangements over the initial JOINSOLVER algorithm was also seen when compared to the Stanford.S22 human Ig dataset with an online VDJ partitioning software evaluation tool. Conclusions: HTJoinSolver can rapidly identify V- and J-segments with indels to high accuracy for mutated sequences when the mutation probability is around 30% and 20% respectively. The D-segment is much harder to fit even at 20% mutation probability. For all segments, the probability of correctly matching V, D, and J increases with our alignment score. Related in: MedlinePlus |
Related In:
Results -
Collection
License 1 - License 2 getmorefigures.php?uid=PMC4492005&req=5
Mentions: Figure 2 demonstrates how the algorithm aligns sequences. In the examples provided in Figure 2, the starting point uses the most 5’ T from the conserved V-motif, TAT TAC TGT. The germline and unknown sequences match for the next 3 bases (G, T, and G), so the algorithm walks along the diagonal in the 5’ direction of the DP matrix. When a mismatch occurs, it traces back and calculates the score over a small rectangular submatrix. In Figure 2a, which corresponds to a single C to A mutation, the high score traces straight along the diagonal. In Figure 2b, which corresponds to a two base insert (AA), the traceback has two steps down. Figure 2c, which has a two base deletion (TC), the traceback has two right steps. In all the examples in the figure, arrows represent the trace backs, the top rows and first columns are shown in bold, and the highest scores are circled. The algorithm continues stepping back in the 5’ direction along the diagonal from the high scoring element. Finally, the algorithm terminates at the top left matrix element. To conserve space in the figure, the algorithm traced back 1 step and a 5 × 5 square matrix was calculated. However, these parameters are too small to avoid falling into local maxima, thus, in actuality, the algorithm uses a 40 × 40 square matrix on a mismatch, and traces back 10 steps.Figure 2 |
View Article: PubMed Central - PubMed
Affiliation: Division of Computational Bioscience, Center for Information Technology, NIH, 12 South Drive, Bethesda, MD, 20892, USA. druss@mail.nih.gov.
Background: Partitioning the human immunoglobulin variable region into variable (V), diversity (D), and joining (J) segments is a common sequence analysis step. We introduce a novel approximate dynamic programming method that uses conserved immunoglobulin gene motifs to improve performance of aligning V-segments of rearranged immunoglobulin (Ig) genes. Our new algorithm enhances the former JOINSOLVER algorithm by processing sequences with insertions and/or deletions (indels) and improves the efficiency for large datasets provided by high throughput sequencing.
Results: In our simulations, which include rearrangements with indels, the V-matching success rate improved from 61% for partial alignments of sequences with indels in the original algorithm to over 99% in the approximate algorithm. An improvement in the alignment of human VDJ rearrangements over the initial JOINSOLVER algorithm was also seen when compared to the Stanford.S22 human Ig dataset with an online VDJ partitioning software evaluation tool.
Conclusions: HTJoinSolver can rapidly identify V- and J-segments with indels to high accuracy for mutated sequences when the mutation probability is around 30% and 20% respectively. The D-segment is much harder to fit even at 20% mutation probability. For all segments, the probability of correctly matching V, D, and J increases with our alignment score.