Limits...
Genetic distance as an alternative to physical distance for definition of gene units in association studies.

Rodriguez-Fontenla C, Calaza M, Gonzalez A - BMC Genomics (2014)

Bottom Line: The definition based in genetic distance was more concordant with the results of these studies than the based in physical distance.In the analysis of 18 top disease associated loci form the first study, the SRR≥2 genes led to a fully concordant interpretation in 17 loci; the ±50 Kb genes only in 6.A gene definition based on genetic distance led to results more concordant with expert detailed analyses than the commonly used based in physical distance.

View Article: PubMed Central - PubMed

Affiliation: Laboratorio de Investigacion 10 and Rheumatology Unit, Instituto de Investigacion Sanitaria - Hospital Clinico Universitario de Santiago, Santiago de Compostela, Spain. antonio.gonzalez.martinez-pedrayo@sergas.es.

ABSTRACT

Background: Some association studies, as the implemented in VEGAS, ALIGATOR, i-GSEA4GWAS, GSA-SNP and other software tools, use genes as the unit of analysis. These genes include the coding sequence plus flanking sequences. Polymorphisms in the flanking sequences are of interest because they involve cis-regulatory elements or they inform on untyped genetic variants trough linkage disequilibrium. Gene extensions have customarily been defined as ±50 Kb. This approach is not fully satisfactory because genetic relationships between neighbouring sequences are a function of genetic distances, which are only poorly replaced by physical distances.

Results: Standardized recombination rates (SRR) from the deCODE recombination map were used as units of genetic distances. We searched for a SRR producing flanking sequences near the ±50 Kb offset that has been common in previous studies. A SRR≥2 was selected because it led to gene extensions with median length=45.3 Kb and the simplicity of an integer value. As expected, boundaries of the genes defined with the ±50 Kb and with the SRR≥2 rules were rarely concordant. The impact of these differences was illustrated with the interpretation of top association signals from two large studies including many hits and their detailed analysis based in different criteria. The definition based in genetic distance was more concordant with the results of these studies than the based in physical distance. In the analysis of 18 top disease associated loci form the first study, the SRR≥2 genes led to a fully concordant interpretation in 17 loci; the ±50 Kb genes only in 6. Interpretation of the 43 putative functional genes of the second study based in the SRR≥2 definition only missed 4 of the genes, whereas the based in the ±50 Kb definition missed 10 genes.

Conclusions: A gene definition based on genetic distance led to results more concordant with expert detailed analyses than the commonly used based in physical distance. The genome coordinates for each gene are provided to maintain a simple use of the new definitions.

Show MeSH

Related in: MedlinePlus

Relevant genes in loci from the WTCCC GWAS depending on gene definitions. Image modified form Figure five of the WTCCC 2007 GWAS paper with permision [19]. Horizontal lines corresponding to the genes overlaping with the region of association have been added in the middle panel of A) B) and C), in red for the ± 50 Kb rule and in green for the SRR ≥ 2 definition. No lines were added for panel D). Association region in each locus is limited by vertical dotted lines. The upper panel represents the SNPs (black dots, genotyped; grey dots, imputed) in function of position (X axis) and association (Y axis = -log10(P)). Middle panel, centimorgans per Mb estimated from Phase II HapMap. The purple line shows the cumulative genetic distance (in cM) from the hit SNP. Lower panel, known genes in orange, top track shows plus-strand genes and the middle track shows minus-strand genes in condensed format. Below these tracks, sequence conservation in 17 vertebrates. Information in middle and lower panels is from the UCSC Genome Browser. Positions are in NCBI build-35 coordinates. Known genes in the hit region according the WTCCC paper are listed in the upper right part.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4048458&req=5

Fig3: Relevant genes in loci from the WTCCC GWAS depending on gene definitions. Image modified form Figure five of the WTCCC 2007 GWAS paper with permision [19]. Horizontal lines corresponding to the genes overlaping with the region of association have been added in the middle panel of A) B) and C), in red for the ± 50 Kb rule and in green for the SRR ≥ 2 definition. No lines were added for panel D). Association region in each locus is limited by vertical dotted lines. The upper panel represents the SNPs (black dots, genotyped; grey dots, imputed) in function of position (X axis) and association (Y axis = -log10(P)). Middle panel, centimorgans per Mb estimated from Phase II HapMap. The purple line shows the cumulative genetic distance (in cM) from the hit SNP. Lower panel, known genes in orange, top track shows plus-strand genes and the middle track shows minus-strand genes in condensed format. Below these tracks, sequence conservation in 17 vertebrates. Information in middle and lower panels is from the UCSC Genome Browser. Positions are in NCBI build-35 coordinates. Known genes in the hit region according the WTCCC paper are listed in the upper right part.

Mentions: We have used two large GWAS with multiple associated loci to illustrate differences between the two gene definitions. However, these analyses should not be confused with an attempt to replace detailed analysis of GWAS results. First, we used the interpretation of 18 top association signals from the 2007 WTCCC GWAS [19]. The authors of this study gave lists of relevant genes for each associated locus based on analysis of the associated SNPs and LD around the top signal. These lists include from 0 to 23 genes. The SRR ≥ 2 definition led to lists that were more concordant with the WTCCC GWAS than the obtained with the ± 50 Kb definition (Table 1 and Additional file 1: Table S1 [20]). All the genes selected by the WTCCC authors were also included when applying the two definitions, but in some loci the gene definitions led to consider some extra genes. Specifically, the SRR ≥ 2 definition included additional genes in one locus, whereas the ± 50 Kb definition included additional genes in 12 of the 18 loci (P = 0.00015 for the comparison of fully concordant loci). In more detail, six loci included an extra gene according to the ± 50 Kb rule (an example shown in Figure 3A); four loci included two extra genes with the ± 50 Kb definition (two of these loci shown in Figure 3B and C); an additional locus included 3 extra genes in the list obtained with the ± 50 Kb definition (Table 1). The remaining locus was the unique in which the three lists were discordant. This locus is particularly difficult because it shows a very low recombination rate and, therefore, a very wide region of association with ill-defined limits (Figure 3D). In addition, it shows a high density of genes implying large differences when applying alternative criteria. Overall, there were 107 genes in the 18 association regions according with detailed analysis done by the WTCCC authors. The definition based on genetic distances led to fully concordant results except for the difficult locus, where no criterion can be considered certain (Figure 3D). In contrast, the definition based on ± 50 Kb included 26 additional genes (P = 0.00025 for the comparison of the number of extra genes). Nine of these extra genes were from the difficult locus in chromosome 3, but there were 17 extra genes in other loci. This example illustrates the very good concordance between post-hoc detailed analysis of each locus done by the WTCCC authors and the simple overlap with gene definitions based on genetic distances. It also illustrates the differences between this definition and the based on a fixed physical distance.Table 1


Genetic distance as an alternative to physical distance for definition of gene units in association studies.

Rodriguez-Fontenla C, Calaza M, Gonzalez A - BMC Genomics (2014)

Relevant genes in loci from the WTCCC GWAS depending on gene definitions. Image modified form Figure five of the WTCCC 2007 GWAS paper with permision [19]. Horizontal lines corresponding to the genes overlaping with the region of association have been added in the middle panel of A) B) and C), in red for the ± 50 Kb rule and in green for the SRR ≥ 2 definition. No lines were added for panel D). Association region in each locus is limited by vertical dotted lines. The upper panel represents the SNPs (black dots, genotyped; grey dots, imputed) in function of position (X axis) and association (Y axis = -log10(P)). Middle panel, centimorgans per Mb estimated from Phase II HapMap. The purple line shows the cumulative genetic distance (in cM) from the hit SNP. Lower panel, known genes in orange, top track shows plus-strand genes and the middle track shows minus-strand genes in condensed format. Below these tracks, sequence conservation in 17 vertebrates. Information in middle and lower panels is from the UCSC Genome Browser. Positions are in NCBI build-35 coordinates. Known genes in the hit region according the WTCCC paper are listed in the upper right part.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4048458&req=5

Fig3: Relevant genes in loci from the WTCCC GWAS depending on gene definitions. Image modified form Figure five of the WTCCC 2007 GWAS paper with permision [19]. Horizontal lines corresponding to the genes overlaping with the region of association have been added in the middle panel of A) B) and C), in red for the ± 50 Kb rule and in green for the SRR ≥ 2 definition. No lines were added for panel D). Association region in each locus is limited by vertical dotted lines. The upper panel represents the SNPs (black dots, genotyped; grey dots, imputed) in function of position (X axis) and association (Y axis = -log10(P)). Middle panel, centimorgans per Mb estimated from Phase II HapMap. The purple line shows the cumulative genetic distance (in cM) from the hit SNP. Lower panel, known genes in orange, top track shows plus-strand genes and the middle track shows minus-strand genes in condensed format. Below these tracks, sequence conservation in 17 vertebrates. Information in middle and lower panels is from the UCSC Genome Browser. Positions are in NCBI build-35 coordinates. Known genes in the hit region according the WTCCC paper are listed in the upper right part.
Mentions: We have used two large GWAS with multiple associated loci to illustrate differences between the two gene definitions. However, these analyses should not be confused with an attempt to replace detailed analysis of GWAS results. First, we used the interpretation of 18 top association signals from the 2007 WTCCC GWAS [19]. The authors of this study gave lists of relevant genes for each associated locus based on analysis of the associated SNPs and LD around the top signal. These lists include from 0 to 23 genes. The SRR ≥ 2 definition led to lists that were more concordant with the WTCCC GWAS than the obtained with the ± 50 Kb definition (Table 1 and Additional file 1: Table S1 [20]). All the genes selected by the WTCCC authors were also included when applying the two definitions, but in some loci the gene definitions led to consider some extra genes. Specifically, the SRR ≥ 2 definition included additional genes in one locus, whereas the ± 50 Kb definition included additional genes in 12 of the 18 loci (P = 0.00015 for the comparison of fully concordant loci). In more detail, six loci included an extra gene according to the ± 50 Kb rule (an example shown in Figure 3A); four loci included two extra genes with the ± 50 Kb definition (two of these loci shown in Figure 3B and C); an additional locus included 3 extra genes in the list obtained with the ± 50 Kb definition (Table 1). The remaining locus was the unique in which the three lists were discordant. This locus is particularly difficult because it shows a very low recombination rate and, therefore, a very wide region of association with ill-defined limits (Figure 3D). In addition, it shows a high density of genes implying large differences when applying alternative criteria. Overall, there were 107 genes in the 18 association regions according with detailed analysis done by the WTCCC authors. The definition based on genetic distances led to fully concordant results except for the difficult locus, where no criterion can be considered certain (Figure 3D). In contrast, the definition based on ± 50 Kb included 26 additional genes (P = 0.00025 for the comparison of the number of extra genes). Nine of these extra genes were from the difficult locus in chromosome 3, but there were 17 extra genes in other loci. This example illustrates the very good concordance between post-hoc detailed analysis of each locus done by the WTCCC authors and the simple overlap with gene definitions based on genetic distances. It also illustrates the differences between this definition and the based on a fixed physical distance.Table 1

Bottom Line: The definition based in genetic distance was more concordant with the results of these studies than the based in physical distance.In the analysis of 18 top disease associated loci form the first study, the SRR≥2 genes led to a fully concordant interpretation in 17 loci; the ±50 Kb genes only in 6.A gene definition based on genetic distance led to results more concordant with expert detailed analyses than the commonly used based in physical distance.

View Article: PubMed Central - PubMed

Affiliation: Laboratorio de Investigacion 10 and Rheumatology Unit, Instituto de Investigacion Sanitaria - Hospital Clinico Universitario de Santiago, Santiago de Compostela, Spain. antonio.gonzalez.martinez-pedrayo@sergas.es.

ABSTRACT

Background: Some association studies, as the implemented in VEGAS, ALIGATOR, i-GSEA4GWAS, GSA-SNP and other software tools, use genes as the unit of analysis. These genes include the coding sequence plus flanking sequences. Polymorphisms in the flanking sequences are of interest because they involve cis-regulatory elements or they inform on untyped genetic variants trough linkage disequilibrium. Gene extensions have customarily been defined as ±50 Kb. This approach is not fully satisfactory because genetic relationships between neighbouring sequences are a function of genetic distances, which are only poorly replaced by physical distances.

Results: Standardized recombination rates (SRR) from the deCODE recombination map were used as units of genetic distances. We searched for a SRR producing flanking sequences near the ±50 Kb offset that has been common in previous studies. A SRR≥2 was selected because it led to gene extensions with median length=45.3 Kb and the simplicity of an integer value. As expected, boundaries of the genes defined with the ±50 Kb and with the SRR≥2 rules were rarely concordant. The impact of these differences was illustrated with the interpretation of top association signals from two large studies including many hits and their detailed analysis based in different criteria. The definition based in genetic distance was more concordant with the results of these studies than the based in physical distance. In the analysis of 18 top disease associated loci form the first study, the SRR≥2 genes led to a fully concordant interpretation in 17 loci; the ±50 Kb genes only in 6. Interpretation of the 43 putative functional genes of the second study based in the SRR≥2 definition only missed 4 of the genes, whereas the based in the ±50 Kb definition missed 10 genes.

Conclusions: A gene definition based on genetic distance led to results more concordant with expert detailed analyses than the commonly used based in physical distance. The genome coordinates for each gene are provided to maintain a simple use of the new definitions.

Show MeSH
Related in: MedlinePlus