Limits...
A flexible Bayesian method for detecting allelic imbalance in RNA-seq data.

León-Novelo LG, McIntyre LM, Fear JM, Graze RM - BMC Genomics (2014)

Bottom Line: The proposed model always has a lower type I error rate compared to the binomial test.Consequently, as variant identification improves, the need for DNA controls will be reduced.Filtering does not significantly improve performance and is not recommended, as information is sacrificed without a measurable gain.

View Article: PubMed Central - PubMed

Affiliation: Department of Biological Sciences, Auburn University, 101 Rouse Life Science Building, 36849 Auburn, AL, USA. rmgraze@auburn.edu.

ABSTRACT

Background: One method of identifying cis regulatory differences is to analyze allele-specific expression (ASE) and identify cases of allelic imbalance (AI). RNA-seq is the most common way to measure ASE and a binomial test is often applied to determine statistical significance of AI. This implicitly assumes that there is no bias in estimation of AI. However, bias has been found to result from multiple factors including: genome ambiguity, reference quality, the mapping algorithm, and biases in the sequencing process. Two alternative approaches have been developed to handle bias: adjusting for bias using a statistical model and filtering regions of the genome suspected of harboring bias. Existing statistical models which account for bias rely on information from DNA controls, which can be cost prohibitive for large intraspecific studies. In contrast, data filtering is inexpensive and straightforward, but necessarily involves sacrificing a portion of the data.

Results: Here we propose a flexible Bayesian model for analysis of AI, which accounts for bias and can be implemented without DNA controls. In lieu of DNA controls, this Poisson-Gamma (PG) model uses an estimate of bias from simulations. The proposed model always has a lower type I error rate compared to the binomial test. Consistent with prior studies, bias dramatically affects the type I error rate. All of the tested models are sensitive to misspecification of bias. The closer the estimate of bias is to the true underlying bias, the lower the type I error rate. Correct estimates of bias result in a level alpha test.

Conclusions: To improve the assessment of AI, some forms of systematic error (e.g., map bias) can be identified using simulation. The resulting estimates of bias can be used to correct for bias in the PG model, without data filtering. Other sources of bias (e.g., unidentified variant calls) can be easily captured by DNA controls, but are missed by common filtering approaches. Consequently, as variant identification improves, the need for DNA controls will be reduced. Filtering does not significantly improve performance and is not recommended, as information is sacrificed without a measurable gain. The PG model developed here performs well when bias is known, or slightly misspecified. The model is flexible and can accommodate differences in experimental design and bias estimation.

Show MeSH

Related in: MedlinePlus

Sources of error in read alignments and allele-specific read counts contribute to bias in estimation of ASE and AI. Here we consider error originating from sequence similarity in the genome (e.g., repeats and duplications) and hidden variation (missed or false SNPs). The examples shown illustrate cases for alignments to a single reference (A-C) and to multiple references (D-E). Alignments to augmented references are expected to behave similarly to alignments to multiple references. A) Masking SNPs located in regions with strong sequence similarity to other locations in the genome (genome sequence ambiguity) can result in alignment error, the best match in the masked reference may be located in a location other than the true source of the read. B) Algorithms that account for multiple mapping can result in allele bias when reads from one of the alleles are discarded or are mapped randomly, while reads from the other allele map to their true source location. C) For a single unmasked reference, reads from one of the alleles may not align at all, resulting in bias toward the other allele. D) When two references are used (one for each parental genome), differences between the references in genome sequence ambiguity can result in allele bias for the same reason as outlined in B. E) Sequencing errors in one reference can result in allele bias when reads from both (identical) alleles align best to the other reference.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4230747&req=5

Fig1: Sources of error in read alignments and allele-specific read counts contribute to bias in estimation of ASE and AI. Here we consider error originating from sequence similarity in the genome (e.g., repeats and duplications) and hidden variation (missed or false SNPs). The examples shown illustrate cases for alignments to a single reference (A-C) and to multiple references (D-E). Alignments to augmented references are expected to behave similarly to alignments to multiple references. A) Masking SNPs located in regions with strong sequence similarity to other locations in the genome (genome sequence ambiguity) can result in alignment error, the best match in the masked reference may be located in a location other than the true source of the read. B) Algorithms that account for multiple mapping can result in allele bias when reads from one of the alleles are discarded or are mapped randomly, while reads from the other allele map to their true source location. C) For a single unmasked reference, reads from one of the alleles may not align at all, resulting in bias toward the other allele. D) When two references are used (one for each parental genome), differences between the references in genome sequence ambiguity can result in allele bias for the same reason as outlined in B. E) Sequencing errors in one reference can result in allele bias when reads from both (identical) alleles align best to the other reference.

Mentions: These Bayesian methods have primarily focused on proper handling of error variance in the statistical model. However, bias in estimation of AI is an important issue for both intraspecific [35, 49] and interspecific [38, 48, 50] studies. Biases are present when aligning to a single reference, a single reference with SNPs masked, and multiple references; which can result in false positives for AI [35, 49–51]. Bias in estimation of allele-specific expression or allelic imbalance has multiple sources, including sequence differences between reads and reference (missed SNPs/false SNPs), properties of alignment algorithms, genome features that result in ambiguity of read alignments and other technical sources of error (Figure 1) [35, 52, 53].Figure 1


A flexible Bayesian method for detecting allelic imbalance in RNA-seq data.

León-Novelo LG, McIntyre LM, Fear JM, Graze RM - BMC Genomics (2014)

Sources of error in read alignments and allele-specific read counts contribute to bias in estimation of ASE and AI. Here we consider error originating from sequence similarity in the genome (e.g., repeats and duplications) and hidden variation (missed or false SNPs). The examples shown illustrate cases for alignments to a single reference (A-C) and to multiple references (D-E). Alignments to augmented references are expected to behave similarly to alignments to multiple references. A) Masking SNPs located in regions with strong sequence similarity to other locations in the genome (genome sequence ambiguity) can result in alignment error, the best match in the masked reference may be located in a location other than the true source of the read. B) Algorithms that account for multiple mapping can result in allele bias when reads from one of the alleles are discarded or are mapped randomly, while reads from the other allele map to their true source location. C) For a single unmasked reference, reads from one of the alleles may not align at all, resulting in bias toward the other allele. D) When two references are used (one for each parental genome), differences between the references in genome sequence ambiguity can result in allele bias for the same reason as outlined in B. E) Sequencing errors in one reference can result in allele bias when reads from both (identical) alleles align best to the other reference.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4230747&req=5

Fig1: Sources of error in read alignments and allele-specific read counts contribute to bias in estimation of ASE and AI. Here we consider error originating from sequence similarity in the genome (e.g., repeats and duplications) and hidden variation (missed or false SNPs). The examples shown illustrate cases for alignments to a single reference (A-C) and to multiple references (D-E). Alignments to augmented references are expected to behave similarly to alignments to multiple references. A) Masking SNPs located in regions with strong sequence similarity to other locations in the genome (genome sequence ambiguity) can result in alignment error, the best match in the masked reference may be located in a location other than the true source of the read. B) Algorithms that account for multiple mapping can result in allele bias when reads from one of the alleles are discarded or are mapped randomly, while reads from the other allele map to their true source location. C) For a single unmasked reference, reads from one of the alleles may not align at all, resulting in bias toward the other allele. D) When two references are used (one for each parental genome), differences between the references in genome sequence ambiguity can result in allele bias for the same reason as outlined in B. E) Sequencing errors in one reference can result in allele bias when reads from both (identical) alleles align best to the other reference.
Mentions: These Bayesian methods have primarily focused on proper handling of error variance in the statistical model. However, bias in estimation of AI is an important issue for both intraspecific [35, 49] and interspecific [38, 48, 50] studies. Biases are present when aligning to a single reference, a single reference with SNPs masked, and multiple references; which can result in false positives for AI [35, 49–51]. Bias in estimation of allele-specific expression or allelic imbalance has multiple sources, including sequence differences between reads and reference (missed SNPs/false SNPs), properties of alignment algorithms, genome features that result in ambiguity of read alignments and other technical sources of error (Figure 1) [35, 52, 53].Figure 1

Bottom Line: The proposed model always has a lower type I error rate compared to the binomial test.Consequently, as variant identification improves, the need for DNA controls will be reduced.Filtering does not significantly improve performance and is not recommended, as information is sacrificed without a measurable gain.

View Article: PubMed Central - PubMed

Affiliation: Department of Biological Sciences, Auburn University, 101 Rouse Life Science Building, 36849 Auburn, AL, USA. rmgraze@auburn.edu.

ABSTRACT

Background: One method of identifying cis regulatory differences is to analyze allele-specific expression (ASE) and identify cases of allelic imbalance (AI). RNA-seq is the most common way to measure ASE and a binomial test is often applied to determine statistical significance of AI. This implicitly assumes that there is no bias in estimation of AI. However, bias has been found to result from multiple factors including: genome ambiguity, reference quality, the mapping algorithm, and biases in the sequencing process. Two alternative approaches have been developed to handle bias: adjusting for bias using a statistical model and filtering regions of the genome suspected of harboring bias. Existing statistical models which account for bias rely on information from DNA controls, which can be cost prohibitive for large intraspecific studies. In contrast, data filtering is inexpensive and straightforward, but necessarily involves sacrificing a portion of the data.

Results: Here we propose a flexible Bayesian model for analysis of AI, which accounts for bias and can be implemented without DNA controls. In lieu of DNA controls, this Poisson-Gamma (PG) model uses an estimate of bias from simulations. The proposed model always has a lower type I error rate compared to the binomial test. Consistent with prior studies, bias dramatically affects the type I error rate. All of the tested models are sensitive to misspecification of bias. The closer the estimate of bias is to the true underlying bias, the lower the type I error rate. Correct estimates of bias result in a level alpha test.

Conclusions: To improve the assessment of AI, some forms of systematic error (e.g., map bias) can be identified using simulation. The resulting estimates of bias can be used to correct for bias in the PG model, without data filtering. Other sources of bias (e.g., unidentified variant calls) can be easily captured by DNA controls, but are missed by common filtering approaches. Consequently, as variant identification improves, the need for DNA controls will be reduced. Filtering does not significantly improve performance and is not recommended, as information is sacrificed without a measurable gain. The PG model developed here performs well when bias is known, or slightly misspecified. The model is flexible and can accommodate differences in experimental design and bias estimation.

Show MeSH
Related in: MedlinePlus