Limits...
quantro: a data-driven approach to guide the choice of an appropriate normalization method.

Hicks SC, Irizarry RA - Genome Biol. (2015)

Bottom Line: Normalization is an essential step in the analysis of high-throughput data.Currently, it is up to the subject matter experts to determine if the stated assumptions are appropriate.Here, we propose a data-driven alternative.

View Article: PubMed Central - PubMed

Affiliation: Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, 450 Brookline Avenue, Boston, MA, 02115-5450, USA. shicks@jimmy.harvard.edu.

ABSTRACT
Normalization is an essential step in the analysis of high-throughput data. Multi-sample global normalization methods, such as quantile normalization, have been successfully used to remove technical variation. However, these methods rely on the assumption that observed global changes across samples are due to unwanted technical variability. Applying global normalization methods has the potential to remove biologically driven variation. Currently, it is up to the subject matter experts to determine if the stated assumptions are appropriate. Here, we propose a data-driven alternative. We demonstrate the utility of our method (quantro) through examples and simulations. A software implementation is available from http://www.bioconductor.org/packages/release/bioc/html/quantro.html .

No MeSH data available.


Related in: MedlinePlus

When to use quantile normalization? Examples of gene expression data with targeted changes and global changes in distributions across groups. a Transformed read counts from n = 65 RNA-Seq samples from the Yoruba (YRI) population and colored by genotype based on the eQTL rs7639979: GG (blue), GA (green) and AA (red). As no global differences in distributions were detected, this suggests quantile normalization is appropriate, but not necessary as there is a low level of variation within and between groups. b Raw perfect match (PM) values from n = 45 arrays comparing the gene expression of alveolar macrophages from nonsmokers (green), smokers (red) and patients with asthma (blue). No global differences in distributions were detected, which indicates quantile normalization is appropriate, as it will remove any platform-based technical variability or batch effects within groups. c Raw PM values from n = 82 arrays comparing brain and liver tissue samples. The samples are colored by tissue (brain [red] and liver [green]), and the shades represent different Gene Expression Omnibus IDs. The global differences in distributions detected across brain and liver tissues indicate quantile normalization is not appropriate. Global changes caused by technical variation (e.g., batch effects across groups) will also be detected by quantro, but raw data alone cannot detect this difference
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4495646&req=5

Fig2: When to use quantile normalization? Examples of gene expression data with targeted changes and global changes in distributions across groups. a Transformed read counts from n = 65 RNA-Seq samples from the Yoruba (YRI) population and colored by genotype based on the eQTL rs7639979: GG (blue), GA (green) and AA (red). As no global differences in distributions were detected, this suggests quantile normalization is appropriate, but not necessary as there is a low level of variation within and between groups. b Raw perfect match (PM) values from n = 45 arrays comparing the gene expression of alveolar macrophages from nonsmokers (green), smokers (red) and patients with asthma (blue). No global differences in distributions were detected, which indicates quantile normalization is appropriate, as it will remove any platform-based technical variability or batch effects within groups. c Raw PM values from n = 82 arrays comparing brain and liver tissue samples. The samples are colored by tissue (brain [red] and liver [green]), and the shades represent different Gene Expression Omnibus IDs. The global differences in distributions detected across brain and liver tissues indicate quantile normalization is not appropriate. Global changes caused by technical variation (e.g., batch effects across groups) will also be detected by quantro, but raw data alone cannot detect this difference

Mentions: Previously, the burden of deciding if these assumptions hold have been left to the experimentalist. Graphical assessments such as boxplots and density plots can be helpful, but they do not provide a quantitative measure of the variability. Here we propose a statistical test, referred to as quantro, for the assumptions of global adjustment methods, such as quantile normalization, that tests for global differences in distributions between groups of samples. Our test uses the raw unprocessed high-throughput data as input to calculate a test statistic comparing the variability of distributions within groups relative to between groups. If the variability between groups is sufficiently larger than the variability within groups, then this suggests there may be global differences in distributions between groups of samples and global adjustment methods may not be appropriate. We demonstrate the advantages of our method by applying it to several gene expression and DNA methylation datasets with targeted and global changes in distributions (Fig. 2). We define global changes as an abundance of differences between two or more sets of samples affecting the shape or the location shift of the distributions across groups caused by a biological or a technical source of variation and targeted changes as differences between sets of samples not affecting the shape or location shift of the distributions caused by a biological or a technical source of variation. To study the specific downstream improvements afforded by quantro we studied the specificity and sensitivity of differential expression estimates with a Monte Carlo simulation. Specifically, we studied how global normalization methods can lead to increased bias in downstream analyses, such as detecting differential methylation, when there are global differences in the distributions and how not applying appropriate normalization methods can lead to increased variance. We demonstrate how by guiding the choice of a normalization technique, our method provides an overall improvement in sensitivity and specificity.Fig. 2


quantro: a data-driven approach to guide the choice of an appropriate normalization method.

Hicks SC, Irizarry RA - Genome Biol. (2015)

When to use quantile normalization? Examples of gene expression data with targeted changes and global changes in distributions across groups. a Transformed read counts from n = 65 RNA-Seq samples from the Yoruba (YRI) population and colored by genotype based on the eQTL rs7639979: GG (blue), GA (green) and AA (red). As no global differences in distributions were detected, this suggests quantile normalization is appropriate, but not necessary as there is a low level of variation within and between groups. b Raw perfect match (PM) values from n = 45 arrays comparing the gene expression of alveolar macrophages from nonsmokers (green), smokers (red) and patients with asthma (blue). No global differences in distributions were detected, which indicates quantile normalization is appropriate, as it will remove any platform-based technical variability or batch effects within groups. c Raw PM values from n = 82 arrays comparing brain and liver tissue samples. The samples are colored by tissue (brain [red] and liver [green]), and the shades represent different Gene Expression Omnibus IDs. The global differences in distributions detected across brain and liver tissues indicate quantile normalization is not appropriate. Global changes caused by technical variation (e.g., batch effects across groups) will also be detected by quantro, but raw data alone cannot detect this difference
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4495646&req=5

Fig2: When to use quantile normalization? Examples of gene expression data with targeted changes and global changes in distributions across groups. a Transformed read counts from n = 65 RNA-Seq samples from the Yoruba (YRI) population and colored by genotype based on the eQTL rs7639979: GG (blue), GA (green) and AA (red). As no global differences in distributions were detected, this suggests quantile normalization is appropriate, but not necessary as there is a low level of variation within and between groups. b Raw perfect match (PM) values from n = 45 arrays comparing the gene expression of alveolar macrophages from nonsmokers (green), smokers (red) and patients with asthma (blue). No global differences in distributions were detected, which indicates quantile normalization is appropriate, as it will remove any platform-based technical variability or batch effects within groups. c Raw PM values from n = 82 arrays comparing brain and liver tissue samples. The samples are colored by tissue (brain [red] and liver [green]), and the shades represent different Gene Expression Omnibus IDs. The global differences in distributions detected across brain and liver tissues indicate quantile normalization is not appropriate. Global changes caused by technical variation (e.g., batch effects across groups) will also be detected by quantro, but raw data alone cannot detect this difference
Mentions: Previously, the burden of deciding if these assumptions hold have been left to the experimentalist. Graphical assessments such as boxplots and density plots can be helpful, but they do not provide a quantitative measure of the variability. Here we propose a statistical test, referred to as quantro, for the assumptions of global adjustment methods, such as quantile normalization, that tests for global differences in distributions between groups of samples. Our test uses the raw unprocessed high-throughput data as input to calculate a test statistic comparing the variability of distributions within groups relative to between groups. If the variability between groups is sufficiently larger than the variability within groups, then this suggests there may be global differences in distributions between groups of samples and global adjustment methods may not be appropriate. We demonstrate the advantages of our method by applying it to several gene expression and DNA methylation datasets with targeted and global changes in distributions (Fig. 2). We define global changes as an abundance of differences between two or more sets of samples affecting the shape or the location shift of the distributions across groups caused by a biological or a technical source of variation and targeted changes as differences between sets of samples not affecting the shape or location shift of the distributions caused by a biological or a technical source of variation. To study the specific downstream improvements afforded by quantro we studied the specificity and sensitivity of differential expression estimates with a Monte Carlo simulation. Specifically, we studied how global normalization methods can lead to increased bias in downstream analyses, such as detecting differential methylation, when there are global differences in the distributions and how not applying appropriate normalization methods can lead to increased variance. We demonstrate how by guiding the choice of a normalization technique, our method provides an overall improvement in sensitivity and specificity.Fig. 2

Bottom Line: Normalization is an essential step in the analysis of high-throughput data.Currently, it is up to the subject matter experts to determine if the stated assumptions are appropriate.Here, we propose a data-driven alternative.

View Article: PubMed Central - PubMed

Affiliation: Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, 450 Brookline Avenue, Boston, MA, 02115-5450, USA. shicks@jimmy.harvard.edu.

ABSTRACT
Normalization is an essential step in the analysis of high-throughput data. Multi-sample global normalization methods, such as quantile normalization, have been successfully used to remove technical variation. However, these methods rely on the assumption that observed global changes across samples are due to unwanted technical variability. Applying global normalization methods has the potential to remove biologically driven variation. Currently, it is up to the subject matter experts to determine if the stated assumptions are appropriate. Here, we propose a data-driven alternative. We demonstrate the utility of our method (quantro) through examples and simulations. A software implementation is available from http://www.bioconductor.org/packages/release/bioc/html/quantro.html .

No MeSH data available.


Related in: MedlinePlus