RIG: Recalibration and interrelation of genomic sequence data with the GATK.
Bottom Line: Comparison with a recent sorghum resequencing study shows that the workflow identifies an additional 1.62 million high-confidence variants from the same sequence data.Finally, the workflow's performance is validated using Arabidopsis sequence data, yielding variant call sets with 95% sensitivity and 99% positive predictive value.The Recalibration and Interrelation of genomic sequence data with the GATK (RIG) workflow enables the GATK to accurately identify genetic variation in organisms lacking validated variant resources.
Affiliation: Interdisciplinary Program in Genetics, Texas A&M University, College Station, Texas 77843 Biochemistry & Biophysics Department, Texas A&M University, College Station, Texas 77843.Show MeSH
Mentions: In accordance with the RIG workflow, we used reduced representation data of an experimental cross and association panels to enable both BQSR and VQSR of WGS data of 49 resequenced individuals for the crop plant Sorghum bicolor. By interrelating data sources produced by different template generation methods with the RIG workflow, we enforced that the variants used to train the VQSR Gaussian mixture models that determine a variant’s VQSLOD score (logarithm of odds ratio that a variant is real vs. not real under the trained Gaussian mixture model) were found orthogonally, providing additional confidence that the variants used for training were real variants. Additionally, the differences in reliability of the training variants due to the different experimental designs were also considered for training of the VQSR models; variants from the experimental cross were assigned a higher prior likelihood of being correct than those from the association panels. By following the RIG workflow, each SNP and indel in the raw WGS variants was assigned a VQSLOD score that reflects its reliability. Figure 5 depicts the process of interrelating data for VQSR and the resulting VQSLOD scores of variant calls. While interrelating data from different template generation methods may be optimal, we also obtained good performance by following similar processing logic using only Arabidopsis WGS data. In this way, the RIG workflow enables one of the greatest strengths of the GATK: the ability to put variant calls in a probabilistic framework that allows users to define where on the sensitivity and specificity spectrum the variants should sit for their target downstream application.
Affiliation: Interdisciplinary Program in Genetics, Texas A&M University, College Station, Texas 77843 Biochemistry & Biophysics Department, Texas A&M University, College Station, Texas 77843.