Limits...
Genotype-based test in mapping cis-regulatory variants from allele-specific expression data.

Lefebvre JF, Vello E, Ge B, Montgomery SB, Dermitzakis ET, Pastinen T, Labuda D - PLoS ONE (2012)

Bottom Line: In this study, we introduce a genotype-based mapping test that does not require haplotype-phase inference to locate regulatory regions.The genotype-based test performed equally well with the experimental AI datasets, either from genome-wide cDNA hybridization arrays or from RNA sequencing.By avoiding the need of haplotype inference, the genotype-based test will suit AI analyses in population samples of unknown haplotype structure and will additionally facilitate the identification of cis-regulatory variants that are located far away from the regulated transcript.

View Article: PubMed Central - PubMed

Affiliation: Centre de Recherche du CHU Sainte-Justine, Université de Montréal, Montréal, Québec, Canada.

ABSTRACT
Identifying and understanding the impact of gene regulatory variation is of considerable importance in evolutionary and medical genetics; such variants are thought to be responsible for human-specific adaptation and to have an important role in genetic disease. Regulatory variation in cis is readily detected in individuals showing uneven expression of a transcript from its two allelic copies, an observation referred to as allelic imbalance (AI). Identifying individuals exhibiting AI allows mapping of regulatory DNA regions and the potential to identify the underlying causal genetic variant(s). However, existing mapping methods require knowledge of the haplotypes, which make them sensitive to phasing errors. In this study, we introduce a genotype-based mapping test that does not require haplotype-phase inference to locate regulatory regions. The test relies on partitioning genotypes of individuals exhibiting AI and those not expressing AI in a 2×3 contingency table. The performance of this test to detect linkage disequilibrium (LD) between a potential regulatory site and a SNP located in this region was examined by analyzing the simulated and the empirical AI datasets. In simulation experiments, the genotype-based test outperforms the haplotype-based tests with the increasing distance separating the regulatory region from its regulated transcript. The genotype-based test performed equally well with the experimental AI datasets, either from genome-wide cDNA hybridization arrays or from RNA sequencing. By avoiding the need of haplotype inference, the genotype-based test will suit AI analyses in population samples of unknown haplotype structure and will additionally facilitate the identification of cis-regulatory variants that are located far away from the regulated transcript.

Show MeSH

Related in: MedlinePlus

Looking for regulatory segment ∼0.1 cM from its regulated transcript.Vertical red lines in the middle correspond to the location of the recombination hotspot, the blue line on the left indicate the location of the R site, and the horizontal black line on the upper right corresponds to the location of the regulated transcript. Results of the binomial test with (A) known haplotypes, (B) haplotypes inferred by PHASE [17] and (C) haplotypes inferred by fastPhase [16]; (D) results of the genotype-based contingency test, unaffected by rephasing; linear regression test using (E) rephased data as in B and (F) rephased data as in C.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3369843&req=5

pone-0038667-g004: Looking for regulatory segment ∼0.1 cM from its regulated transcript.Vertical red lines in the middle correspond to the location of the recombination hotspot, the blue line on the left indicate the location of the R site, and the horizontal black line on the upper right corresponds to the location of the regulated transcript. Results of the binomial test with (A) known haplotypes, (B) haplotypes inferred by PHASE [17] and (C) haplotypes inferred by fastPhase [16]; (D) results of the genotype-based contingency test, unaffected by rephasing; linear regression test using (E) rephased data as in B and (F) rephased data as in C.

Mentions: To address these issues we examined the effect of chromosomal phasing in a situation when the regulated transcript is located at a certain distance from its R site, separated by a recombination hotspot placed in the middle of 100 kb as illustrated in Figure 4. We selected simulations assigning a regulatory site at a given r allele frequency at the beginning of the sequence. The R and r chromosomes of each AI individual were flagged with help of a heterozygous SNP Aa at the other end within the transcribed portion of the sequence (Figure 4). After rephasing, the A and a alleles were used to define R and r chromosomes and the p values of the SNPs surrounding the original R site were assessed again. The presence of a single hotspot (here defining a genetic distance of ∼0.1 cM) between the virtual start site of transcription and the regulatory region was sufficient to cause a dramatic loss of power of the haplotype-based tests. Overall, for simulations at r frequency of ∼0.35, there is a loss of 98.7% (binomial) and 91.1% (linear regression) of significant SNPs (at p<0.01 level) after chromosome phasing using fastPhase [16] and 29.7% and 19.7%, respectively when using PHASE [17] (compare Figure 4B with C, and E with F). Therefore, from now on we will only present results obtained with better performing PHASE software. These results are shown in Figure 5 in the form of plots of the log(1/p) values of all SNPs as a function of their corresponding r2 coefficients with the R site. The upper panels illustrate how the three tests perform when phase is exactly known, and how their log(1/p) values relate to the LD coefficient r2. After rephasing, there is a substantial decrease of log(1/p) values in haplotype-based tests but not in the contingency test (see also Figures S1, S2, S3 for the data at other frequencies of r). Furthermore, we evaluated the performance of the tests by comparing the log(1/p) values when the extent of phasing errors was known. After rephasing, we extracted the data sets where all AI individuals were in phase, i.e. without phase switch error between the regulatory and the transcribed region. We also separated simulations where switch errors were observed in only one individual, in two individuals and in three or more individuals. In simulation experiments at r frequency of ∼0.35, the phase was conserved in all AI individuals (n = 23) in 15.5% of simulations, in 24% of the simulations we found switch error one individual, in 25% in two individuals and in three or more in the remaining 35% of simulations (Table 2). Figure 6 compares log(1/p) values obtained in these four data sets before and after rephasing using the contingency (red dots) and the linear regression test (black dots). Due to phasing errors the drop in log(1/p) values is only observed in the case of linear regression test. While already noticeable in the data sets without switch errors in AI individuals, the effect becomes dramatic when two or more individuals are affected (see also Figures S4, S5, S6). It can thus be expected that, with an increasing genetic distance between regulated and regulatory regions, the haplotype-based tests will become even more vulnerable. In other words, an accumulation of phasing errors may preclude the efficient use of haplotype based-tests in mapping regulatory regions that are located far from their regulated transcripts [3], [5], [18]


Genotype-based test in mapping cis-regulatory variants from allele-specific expression data.

Lefebvre JF, Vello E, Ge B, Montgomery SB, Dermitzakis ET, Pastinen T, Labuda D - PLoS ONE (2012)

Looking for regulatory segment ∼0.1 cM from its regulated transcript.Vertical red lines in the middle correspond to the location of the recombination hotspot, the blue line on the left indicate the location of the R site, and the horizontal black line on the upper right corresponds to the location of the regulated transcript. Results of the binomial test with (A) known haplotypes, (B) haplotypes inferred by PHASE [17] and (C) haplotypes inferred by fastPhase [16]; (D) results of the genotype-based contingency test, unaffected by rephasing; linear regression test using (E) rephased data as in B and (F) rephased data as in C.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3369843&req=5

pone-0038667-g004: Looking for regulatory segment ∼0.1 cM from its regulated transcript.Vertical red lines in the middle correspond to the location of the recombination hotspot, the blue line on the left indicate the location of the R site, and the horizontal black line on the upper right corresponds to the location of the regulated transcript. Results of the binomial test with (A) known haplotypes, (B) haplotypes inferred by PHASE [17] and (C) haplotypes inferred by fastPhase [16]; (D) results of the genotype-based contingency test, unaffected by rephasing; linear regression test using (E) rephased data as in B and (F) rephased data as in C.
Mentions: To address these issues we examined the effect of chromosomal phasing in a situation when the regulated transcript is located at a certain distance from its R site, separated by a recombination hotspot placed in the middle of 100 kb as illustrated in Figure 4. We selected simulations assigning a regulatory site at a given r allele frequency at the beginning of the sequence. The R and r chromosomes of each AI individual were flagged with help of a heterozygous SNP Aa at the other end within the transcribed portion of the sequence (Figure 4). After rephasing, the A and a alleles were used to define R and r chromosomes and the p values of the SNPs surrounding the original R site were assessed again. The presence of a single hotspot (here defining a genetic distance of ∼0.1 cM) between the virtual start site of transcription and the regulatory region was sufficient to cause a dramatic loss of power of the haplotype-based tests. Overall, for simulations at r frequency of ∼0.35, there is a loss of 98.7% (binomial) and 91.1% (linear regression) of significant SNPs (at p<0.01 level) after chromosome phasing using fastPhase [16] and 29.7% and 19.7%, respectively when using PHASE [17] (compare Figure 4B with C, and E with F). Therefore, from now on we will only present results obtained with better performing PHASE software. These results are shown in Figure 5 in the form of plots of the log(1/p) values of all SNPs as a function of their corresponding r2 coefficients with the R site. The upper panels illustrate how the three tests perform when phase is exactly known, and how their log(1/p) values relate to the LD coefficient r2. After rephasing, there is a substantial decrease of log(1/p) values in haplotype-based tests but not in the contingency test (see also Figures S1, S2, S3 for the data at other frequencies of r). Furthermore, we evaluated the performance of the tests by comparing the log(1/p) values when the extent of phasing errors was known. After rephasing, we extracted the data sets where all AI individuals were in phase, i.e. without phase switch error between the regulatory and the transcribed region. We also separated simulations where switch errors were observed in only one individual, in two individuals and in three or more individuals. In simulation experiments at r frequency of ∼0.35, the phase was conserved in all AI individuals (n = 23) in 15.5% of simulations, in 24% of the simulations we found switch error one individual, in 25% in two individuals and in three or more in the remaining 35% of simulations (Table 2). Figure 6 compares log(1/p) values obtained in these four data sets before and after rephasing using the contingency (red dots) and the linear regression test (black dots). Due to phasing errors the drop in log(1/p) values is only observed in the case of linear regression test. While already noticeable in the data sets without switch errors in AI individuals, the effect becomes dramatic when two or more individuals are affected (see also Figures S4, S5, S6). It can thus be expected that, with an increasing genetic distance between regulated and regulatory regions, the haplotype-based tests will become even more vulnerable. In other words, an accumulation of phasing errors may preclude the efficient use of haplotype based-tests in mapping regulatory regions that are located far from their regulated transcripts [3], [5], [18]

Bottom Line: In this study, we introduce a genotype-based mapping test that does not require haplotype-phase inference to locate regulatory regions.The genotype-based test performed equally well with the experimental AI datasets, either from genome-wide cDNA hybridization arrays or from RNA sequencing.By avoiding the need of haplotype inference, the genotype-based test will suit AI analyses in population samples of unknown haplotype structure and will additionally facilitate the identification of cis-regulatory variants that are located far away from the regulated transcript.

View Article: PubMed Central - PubMed

Affiliation: Centre de Recherche du CHU Sainte-Justine, Université de Montréal, Montréal, Québec, Canada.

ABSTRACT
Identifying and understanding the impact of gene regulatory variation is of considerable importance in evolutionary and medical genetics; such variants are thought to be responsible for human-specific adaptation and to have an important role in genetic disease. Regulatory variation in cis is readily detected in individuals showing uneven expression of a transcript from its two allelic copies, an observation referred to as allelic imbalance (AI). Identifying individuals exhibiting AI allows mapping of regulatory DNA regions and the potential to identify the underlying causal genetic variant(s). However, existing mapping methods require knowledge of the haplotypes, which make them sensitive to phasing errors. In this study, we introduce a genotype-based mapping test that does not require haplotype-phase inference to locate regulatory regions. The test relies on partitioning genotypes of individuals exhibiting AI and those not expressing AI in a 2×3 contingency table. The performance of this test to detect linkage disequilibrium (LD) between a potential regulatory site and a SNP located in this region was examined by analyzing the simulated and the empirical AI datasets. In simulation experiments, the genotype-based test outperforms the haplotype-based tests with the increasing distance separating the regulatory region from its regulated transcript. The genotype-based test performed equally well with the experimental AI datasets, either from genome-wide cDNA hybridization arrays or from RNA sequencing. By avoiding the need of haplotype inference, the genotype-based test will suit AI analyses in population samples of unknown haplotype structure and will additionally facilitate the identification of cis-regulatory variants that are located far away from the regulated transcript.

Show MeSH
Related in: MedlinePlus