Limits...
Defining window-boundaries for genomic analyses using smoothing spline techniques.

Beissinger TM, Rosa GJ, Kaeppler SM, Gianola D, de Leon N - Genet. Sel. Evol. (2015)

Bottom Line: Interpretation of data grouped in windows versus at individual locations may increase statistical power, simplify computation, reduce sampling noise, and reduce the total number of tests performed.However, use of adjacent marker information can result in over- or under-smoothing, undesirable window boundary specifications, or highly correlated test statistics.We have developed a novel technique to identify window boundaries for subsequent analysis protocols.

View Article: PubMed Central - PubMed

Affiliation: Department of Plant Sciences, University of California, Davis, 95616, USA. beissinger@ucdavis.edu.

ABSTRACT

Background: High-density genomic data is often analyzed by combining information over windows of adjacent markers. Interpretation of data grouped in windows versus at individual locations may increase statistical power, simplify computation, reduce sampling noise, and reduce the total number of tests performed. However, use of adjacent marker information can result in over- or under-smoothing, undesirable window boundary specifications, or highly correlated test statistics. We introduce a method for defining windows based on statistically guided breakpoints in the data, as a foundation for the analysis of multiple adjacent data points. This method involves first fitting a cubic smoothing spline to the data and then identifying the inflection points of the fitted spline, which serve as the boundaries of adjacent windows. This technique does not require prior knowledge of linkage disequilibrium, and therefore can be applied to data collected from individual or pooled sequencing experiments. Moreover, in contrast to existing methods, an arbitrary choice of window size is not necessary, since these are determined empirically and allowed to vary along the genome.

Results: Simulations applying this method were performed to identify selection signatures from pooled sequencing FST data, for which allele frequencies were estimated from a pool of individuals. The relative ratio of true to false positives was twice that generated by existing techniques. A comparison of the approach to a previous study that involved pooled sequencing FST data from maize suggested that outlying windows were more clearly separated from their neighbors than when using a standard sliding window approach.

Conclusions: We have developed a novel technique to identify window boundaries for subsequent analysis protocols. When applied to selection studies based on F ST data, this method provides a high discovery rate and minimizes false positives. The method is implemented in the R package GenWin, which is publicly available from CRAN.

No MeSH data available.


Related in: MedlinePlus

Depiction of the method. The spline-window method is presented step by step using a simulated set of 200 markers across a chromosome region. (A) Raw data (FST) computed from individual markers. (B) A cubic smoothing spline indicated by the red line, is fitted to the data. (C) Inflection points of the spline are indicated by dashed vertical lines. (D) Inflection points of the spline are used to define window boundaries, and a statistic such as W is computed.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4404117&req=5

Fig1: Depiction of the method. The spline-window method is presented step by step using a simulated set of 200 markers across a chromosome region. (A) Raw data (FST) computed from individual markers. (B) A cubic smoothing spline indicated by the red line, is fitted to the data. (C) Inflection points of the spline are indicated by dashed vertical lines. (D) Inflection points of the spline are used to define window boundaries, and a statistic such as W is computed.

Mentions: An overview of the smoothing spline method to define windows is provided in FigureĀ 1. In the first step, a smoothing spline, , is fitted to the raw, or unsmoothed, data, measured for individual SNPs. Next, this fitted spline is used to identify the positions at which the data are split for window-based analysis. Specifically, the inflection points of the fitted spline (positions where ) are taken as window boundaries. Because will necessarily be concave-down at every local maximum, every potential peak in the spline and therefore every predicted peak in the underlying data are nearly ensured to fall into a single window, rather than being split between windows. In addition, large windows are created in regions where is mostly flat, which implies a low amount of signal relative to noise, and small windows are created in regions where is rougher, which indicates higher signal relative to noise. Once windows have been defined, analyses may proceed as with any methodology that involves genomic windows. However, the non-uniformity of window sizes must be appropriately accounted for in such analyses. Certain statistics naturally account for this variability. For instance, a simple t-test to assess changes in expected heterozygosity between two populations will appropriately handle differences in the number of observations per window. However, certain situations require a more cautious treatment of the variability in window sizes. FST-based scans for selection, for example, often use outlier-based thresholds to identify potentially interesting regions with high FST values [12]. In this setting, seemingly high values from small windows may be less informative than seemingly intermediate values from large windows, due to the greater sampling error associated with fewer markers being included in smaller windows. A reasonable approximation is to consider individual markers as independent and identically distributed observations with some underlying mean value across the window. This assumption requires random variability about the window mean, which is likely to be achieved for reasonably small windows, and is no more of an assumption than is typically made for window-based analyses. Then, a t-test like statistic, W, may be generated such that:Figure 1


Defining window-boundaries for genomic analyses using smoothing spline techniques.

Beissinger TM, Rosa GJ, Kaeppler SM, Gianola D, de Leon N - Genet. Sel. Evol. (2015)

Depiction of the method. The spline-window method is presented step by step using a simulated set of 200 markers across a chromosome region. (A) Raw data (FST) computed from individual markers. (B) A cubic smoothing spline indicated by the red line, is fitted to the data. (C) Inflection points of the spline are indicated by dashed vertical lines. (D) Inflection points of the spline are used to define window boundaries, and a statistic such as W is computed.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4404117&req=5

Fig1: Depiction of the method. The spline-window method is presented step by step using a simulated set of 200 markers across a chromosome region. (A) Raw data (FST) computed from individual markers. (B) A cubic smoothing spline indicated by the red line, is fitted to the data. (C) Inflection points of the spline are indicated by dashed vertical lines. (D) Inflection points of the spline are used to define window boundaries, and a statistic such as W is computed.
Mentions: An overview of the smoothing spline method to define windows is provided in FigureĀ 1. In the first step, a smoothing spline, , is fitted to the raw, or unsmoothed, data, measured for individual SNPs. Next, this fitted spline is used to identify the positions at which the data are split for window-based analysis. Specifically, the inflection points of the fitted spline (positions where ) are taken as window boundaries. Because will necessarily be concave-down at every local maximum, every potential peak in the spline and therefore every predicted peak in the underlying data are nearly ensured to fall into a single window, rather than being split between windows. In addition, large windows are created in regions where is mostly flat, which implies a low amount of signal relative to noise, and small windows are created in regions where is rougher, which indicates higher signal relative to noise. Once windows have been defined, analyses may proceed as with any methodology that involves genomic windows. However, the non-uniformity of window sizes must be appropriately accounted for in such analyses. Certain statistics naturally account for this variability. For instance, a simple t-test to assess changes in expected heterozygosity between two populations will appropriately handle differences in the number of observations per window. However, certain situations require a more cautious treatment of the variability in window sizes. FST-based scans for selection, for example, often use outlier-based thresholds to identify potentially interesting regions with high FST values [12]. In this setting, seemingly high values from small windows may be less informative than seemingly intermediate values from large windows, due to the greater sampling error associated with fewer markers being included in smaller windows. A reasonable approximation is to consider individual markers as independent and identically distributed observations with some underlying mean value across the window. This assumption requires random variability about the window mean, which is likely to be achieved for reasonably small windows, and is no more of an assumption than is typically made for window-based analyses. Then, a t-test like statistic, W, may be generated such that:Figure 1

Bottom Line: Interpretation of data grouped in windows versus at individual locations may increase statistical power, simplify computation, reduce sampling noise, and reduce the total number of tests performed.However, use of adjacent marker information can result in over- or under-smoothing, undesirable window boundary specifications, or highly correlated test statistics.We have developed a novel technique to identify window boundaries for subsequent analysis protocols.

View Article: PubMed Central - PubMed

Affiliation: Department of Plant Sciences, University of California, Davis, 95616, USA. beissinger@ucdavis.edu.

ABSTRACT

Background: High-density genomic data is often analyzed by combining information over windows of adjacent markers. Interpretation of data grouped in windows versus at individual locations may increase statistical power, simplify computation, reduce sampling noise, and reduce the total number of tests performed. However, use of adjacent marker information can result in over- or under-smoothing, undesirable window boundary specifications, or highly correlated test statistics. We introduce a method for defining windows based on statistically guided breakpoints in the data, as a foundation for the analysis of multiple adjacent data points. This method involves first fitting a cubic smoothing spline to the data and then identifying the inflection points of the fitted spline, which serve as the boundaries of adjacent windows. This technique does not require prior knowledge of linkage disequilibrium, and therefore can be applied to data collected from individual or pooled sequencing experiments. Moreover, in contrast to existing methods, an arbitrary choice of window size is not necessary, since these are determined empirically and allowed to vary along the genome.

Results: Simulations applying this method were performed to identify selection signatures from pooled sequencing FST data, for which allele frequencies were estimated from a pool of individuals. The relative ratio of true to false positives was twice that generated by existing techniques. A comparison of the approach to a previous study that involved pooled sequencing FST data from maize suggested that outlying windows were more clearly separated from their neighbors than when using a standard sliding window approach.

Conclusions: We have developed a novel technique to identify window boundaries for subsequent analysis protocols. When applied to selection studies based on F ST data, this method provides a high discovery rate and minimizes false positives. The method is implemented in the R package GenWin, which is publicly available from CRAN.

No MeSH data available.


Related in: MedlinePlus