Limits...
A Hybrid One-Way ANOVA Approach for the Robust and Efficient Estimation of Differential Gene Expression with Multiple Patterns.

Mollah MM, Jamal R, Mokhtar NM, Harun R, Mollah MN - PLoS ONE (2015)

Bottom Line: It assigns smaller weights (≥ 0) to outlying expressions and larger weights (≤ 1) to typical expressions.The distribution of the β-weights is used to calculate the cut-off point, which is compared to the observed β-weight of an expression to determine whether that gene expression is an outlier.This weight function plays a key role in unifying the robustness and efficiency of estimation in one-way ANOVA.

View Article: PubMed Central - PubMed

Affiliation: Institut Perubatan Molekul UKM (UMBI), University Kebangsaan Malaysia (UKM), Jalan Ya'acob Latiff, Bandar Tun Razak, Cheras 56000 Kuala Lumpur, Malaysia.

ABSTRACT

Background: Identifying genes that are differentially expressed (DE) between two or more conditions with multiple patterns of expression is one of the primary objectives of gene expression data analysis. Several statistical approaches, including one-way analysis of variance (ANOVA), are used to identify DE genes. However, most of these methods provide misleading results for two or more conditions with multiple patterns of expression in the presence of outlying genes. In this paper, an attempt is made to develop a hybrid one-way ANOVA approach that unifies the robustness and efficiency of estimation using the minimum β-divergence method to overcome some problems that arise in the existing robust methods for both small- and large-sample cases with multiple patterns of expression.

Results: The proposed method relies on a β-weight function, which produces values between 0 and 1. The β-weight function with β = 0.2 is used as a measure of outlier detection. It assigns smaller weights (≥ 0) to outlying expressions and larger weights (≤ 1) to typical expressions. The distribution of the β-weights is used to calculate the cut-off point, which is compared to the observed β-weight of an expression to determine whether that gene expression is an outlier. This weight function plays a key role in unifying the robustness and efficiency of estimation in one-way ANOVA.

Conclusion: Analyses of simulated gene expression profiles revealed that all eight methods (ANOVA, SAM, LIMMA, EBarrays, eLNN, KW, robust BetaEB and proposed) perform almost identically for m = 2 conditions in the absence of outliers. However, the robust BetaEB method and the proposed method exhibited considerably better performance than the other six methods in the presence of outliers. In this case, the BetaEB method exhibited slightly better performance than the proposed method for the small-sample cases, but the the proposed method exhibited much better performance than the BetaEB method for both the small- and large-sample cases in the presence of more than 50% outlying genes. The proposed method also exhibited better performance than the other methods for m > 2 conditions with multiple patterns of expression, where the BetaEB was not extended for this condition. Therefore, the proposed approach would be more suitable and reliable on average for the identification of DE genes between two or more conditions with multiple patterns of expression.

No MeSH data available.


Related in: MedlinePlus

Venn diagram and outlier gene expression profile for pancreatic cancer data.(a) Venn diagram of the DE genes estimated by all four methods (ANOVA, LIMMA, KW and Proposed) based on pairwise comparisons of CTC vs T, CTC vs P, CTC vs G, T vs P, T vs G and G vs P. (b) Frequency distributions of β-weights for each expression of the 8152 genes in 24 samples. (c) Scatter plot of the smallest β-weight for each of the 8152 genes vs. the gene index, where the smallest value represents the minimum value of 24 β-weights from 24 samples for each gene. The red circles between the two gray lines represent moderate/noisy outliers, whereas the other red circles, corresponding to β-weights of less than 0.2, represent extreme outliers. (d) Plot of ordered smallest β-weights in (c) for 8152 genes. (e) The 80 DE genes detected by the proposed method only, as shown in (a). Seventeen out of 80 DE genes were detected as extreme outlying genes using the β-weight function. The results for the T, P, G and CTC groups are plotted above the lines with four different colors. The outlying samples are indicated by circles above them.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4587675&req=5

pone.0138810.g003: Venn diagram and outlier gene expression profile for pancreatic cancer data.(a) Venn diagram of the DE genes estimated by all four methods (ANOVA, LIMMA, KW and Proposed) based on pairwise comparisons of CTC vs T, CTC vs P, CTC vs G, T vs P, T vs G and G vs P. (b) Frequency distributions of β-weights for each expression of the 8152 genes in 24 samples. (c) Scatter plot of the smallest β-weight for each of the 8152 genes vs. the gene index, where the smallest value represents the minimum value of 24 β-weights from 24 samples for each gene. The red circles between the two gray lines represent moderate/noisy outliers, whereas the other red circles, corresponding to β-weights of less than 0.2, represent extreme outliers. (d) Plot of ordered smallest β-weights in (c) for 8152 genes. (e) The 80 DE genes detected by the proposed method only, as shown in (a). Seventeen out of 80 DE genes were detected as extreme outlying genes using the β-weight function. The results for the T, P, G and CTC groups are plotted above the lines with four different colors. The outlying samples are indicated by circles above them.

Mentions: This dataset was used in the research reported in [29]. It consists of 24 samples, with 6 replicates in each of m = 4 conditions —circulation tumor samples (CTC), hematological cells (G), original tumor tissue (T), and non-tumor pancreatic control tissue (P) from patients and represents 8152 genes. We downloaded this dataset from the website http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?token=zbotduky//wssgujm/&acc=GSE18670. This dataset had been pre-processed by previous users [29]. Therefore, we directly applied four methods (ANOVA, LIMMA, KW and Proposed) to this dataset to test the hypothesis (H0) of no differential expressions among four groups for each gene. If the H0 was rejected, we performed a pairwise comparison test for all 4 methods to identify the pattern of gene expressions. Fig 3a presents a Venn diagram of the DE genes estimated by the four methods based on pairwise comparisons of CTC vs T, CTC vs P, CTC vs G, T vs P, T vs G and G vs P, which was also the approach used in [29]. We used the Bonferroni method to adjust the p-values for ANOVA and KW and the Benjamini-Hochberg (BH) method to adjust the p-values for LIMMA and the proposed method. We used adjusted p-values at the 5% level of significance with an absolute fold change of 2 for each pairwise comparison in identifying DE genes. From this Venn diagram, it is evident that 2712 DE genes were common to all four methods in these 6 pairwise comparisons. The β-weight function of the proposed method identified 31% genes in the entire dataset as outliers. The outlier genes are indicated in red in Fig 3(b-d). However, there exist some outliers that may not influence the classical methods. In our simulation studies, we considered some special types of outliers that influence the classical methods. For example, let us consider a true EE gene between two groups as EE = {(20, 21, 22), (21, 20, 22)} and a true DE gene between two groups as DE = {(20, 21, 22), (26, 27, 28)}, where () is used to represent the samples of expressions from a group. In the case of an EE gene, we observe that the absolute mean difference between two groups is 0, whereas in the case of a DE gene, the absolute mean difference between two groups is 6. If we corrupted the EE gene by an outlier expression as EE1 = {(20, 59*, 22), (21, 60*, 22)}, EE2 = {(20, 0*, 22), (21, 1*, 22)}, EE3 = {(20, 59*, 22), (21, 90*, 22)}, EE4 = {(20, 21, 22), (21, 60*, 22)}, we will observe that outliers (*) in EE1 and EE2 cannot influence the classical/non-robust methods, whereas outliers (*) in EE3 and EE4 can highly influence the classical/non-robust methods. Similarly, if we corrupted the previous DE gene by an outlier expression as DE1 = {(20, 21, 22), (26, 70*, 28)}, DE2 = {(20, 1*, 22), (26, 27, 28)}, DE3 = {(20, 1*, 22), (26, 70*, 28)}, DE4 = {(20, 21, 22), (26, 8*, 28)}, DE5 = {(20, 38*, 22), (26, 27, 28)}, DE6 = {(20, 29*, 22), (26, 17*, 28)}, we will observe that the assumed outliers (*) in DE1- DE3 cannot influence the classical/non-robust methods, whereas outliers (*) in DE4-DE6 can highly influence the classical/non-robust methods. The group variance/scale is 1 in both cases of original EE/DE genes, whereas the group variance/scale lies between 1 and 1565 in both cases of outlying EE/DE genes. In the current real data analysis, we found that the proposed method detected 80 DE genes that were not detected by any of the other methods. Among these 80 DE genes, 35 genes were corrupted by outliers. Among these 35 outlying genes, we considered 17 extreme outlying genes expressions for functional analysis to investigate their importance as an example. These outlying genes were then plotted above the lines with four different colors for the T, P, G and CTC groups, where the outlier samples are indicated by red circles above (Fig 3e). To compare all possible comparisons, only genes with significant 2-fold up/down regulation were selected (using adjusted p-values ≤ 0.05) for all methods (Table 9). In this table, we observed a large number of DE genes detected by the proposed method. We searched for functional enrichment of specific pathways by these 17 DE genes using the WebGestalt gene analysis toolkit [30]. By mapping the differentially expressed gene set against the biological function annotations in the Gene Ontology database, we found significant enrichment of genes involved in the positive regulation of muscle-cell differentiation, the immune-response-regulating cell surface receptor signaling pathway, and the antigen-receptor-mediated signaling pathway, as well as genes involved in the carboxylic acid biosynthetic process (S2 File (xls)). Using the KEGG database, we found that several probes mapped to signaling pathways, including the beta-Alanine metabolism pathway, the MAPK signaling pathway and other metabolic pathways (Table 10 and S3 File (xls)). The pathway with the highest expression ratio in CTC was the p38 mitogen-activated protein kinase (p38 MAPK) signaling pathway, which is known to be involved in cancer cell migration. In the p38 MAPK pathway, PP2CB and MEF2C were significantly upregulated. The simulation study results in Table 7 also increase our confidence in the results of the proposed method for this real dataset because that simulated dataset was generated in accordance with the parametric properties of this real dataset.


A Hybrid One-Way ANOVA Approach for the Robust and Efficient Estimation of Differential Gene Expression with Multiple Patterns.

Mollah MM, Jamal R, Mokhtar NM, Harun R, Mollah MN - PLoS ONE (2015)

Venn diagram and outlier gene expression profile for pancreatic cancer data.(a) Venn diagram of the DE genes estimated by all four methods (ANOVA, LIMMA, KW and Proposed) based on pairwise comparisons of CTC vs T, CTC vs P, CTC vs G, T vs P, T vs G and G vs P. (b) Frequency distributions of β-weights for each expression of the 8152 genes in 24 samples. (c) Scatter plot of the smallest β-weight for each of the 8152 genes vs. the gene index, where the smallest value represents the minimum value of 24 β-weights from 24 samples for each gene. The red circles between the two gray lines represent moderate/noisy outliers, whereas the other red circles, corresponding to β-weights of less than 0.2, represent extreme outliers. (d) Plot of ordered smallest β-weights in (c) for 8152 genes. (e) The 80 DE genes detected by the proposed method only, as shown in (a). Seventeen out of 80 DE genes were detected as extreme outlying genes using the β-weight function. The results for the T, P, G and CTC groups are plotted above the lines with four different colors. The outlying samples are indicated by circles above them.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4587675&req=5

pone.0138810.g003: Venn diagram and outlier gene expression profile for pancreatic cancer data.(a) Venn diagram of the DE genes estimated by all four methods (ANOVA, LIMMA, KW and Proposed) based on pairwise comparisons of CTC vs T, CTC vs P, CTC vs G, T vs P, T vs G and G vs P. (b) Frequency distributions of β-weights for each expression of the 8152 genes in 24 samples. (c) Scatter plot of the smallest β-weight for each of the 8152 genes vs. the gene index, where the smallest value represents the minimum value of 24 β-weights from 24 samples for each gene. The red circles between the two gray lines represent moderate/noisy outliers, whereas the other red circles, corresponding to β-weights of less than 0.2, represent extreme outliers. (d) Plot of ordered smallest β-weights in (c) for 8152 genes. (e) The 80 DE genes detected by the proposed method only, as shown in (a). Seventeen out of 80 DE genes were detected as extreme outlying genes using the β-weight function. The results for the T, P, G and CTC groups are plotted above the lines with four different colors. The outlying samples are indicated by circles above them.
Mentions: This dataset was used in the research reported in [29]. It consists of 24 samples, with 6 replicates in each of m = 4 conditions —circulation tumor samples (CTC), hematological cells (G), original tumor tissue (T), and non-tumor pancreatic control tissue (P) from patients and represents 8152 genes. We downloaded this dataset from the website http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?token=zbotduky//wssgujm/&acc=GSE18670. This dataset had been pre-processed by previous users [29]. Therefore, we directly applied four methods (ANOVA, LIMMA, KW and Proposed) to this dataset to test the hypothesis (H0) of no differential expressions among four groups for each gene. If the H0 was rejected, we performed a pairwise comparison test for all 4 methods to identify the pattern of gene expressions. Fig 3a presents a Venn diagram of the DE genes estimated by the four methods based on pairwise comparisons of CTC vs T, CTC vs P, CTC vs G, T vs P, T vs G and G vs P, which was also the approach used in [29]. We used the Bonferroni method to adjust the p-values for ANOVA and KW and the Benjamini-Hochberg (BH) method to adjust the p-values for LIMMA and the proposed method. We used adjusted p-values at the 5% level of significance with an absolute fold change of 2 for each pairwise comparison in identifying DE genes. From this Venn diagram, it is evident that 2712 DE genes were common to all four methods in these 6 pairwise comparisons. The β-weight function of the proposed method identified 31% genes in the entire dataset as outliers. The outlier genes are indicated in red in Fig 3(b-d). However, there exist some outliers that may not influence the classical methods. In our simulation studies, we considered some special types of outliers that influence the classical methods. For example, let us consider a true EE gene between two groups as EE = {(20, 21, 22), (21, 20, 22)} and a true DE gene between two groups as DE = {(20, 21, 22), (26, 27, 28)}, where () is used to represent the samples of expressions from a group. In the case of an EE gene, we observe that the absolute mean difference between two groups is 0, whereas in the case of a DE gene, the absolute mean difference between two groups is 6. If we corrupted the EE gene by an outlier expression as EE1 = {(20, 59*, 22), (21, 60*, 22)}, EE2 = {(20, 0*, 22), (21, 1*, 22)}, EE3 = {(20, 59*, 22), (21, 90*, 22)}, EE4 = {(20, 21, 22), (21, 60*, 22)}, we will observe that outliers (*) in EE1 and EE2 cannot influence the classical/non-robust methods, whereas outliers (*) in EE3 and EE4 can highly influence the classical/non-robust methods. Similarly, if we corrupted the previous DE gene by an outlier expression as DE1 = {(20, 21, 22), (26, 70*, 28)}, DE2 = {(20, 1*, 22), (26, 27, 28)}, DE3 = {(20, 1*, 22), (26, 70*, 28)}, DE4 = {(20, 21, 22), (26, 8*, 28)}, DE5 = {(20, 38*, 22), (26, 27, 28)}, DE6 = {(20, 29*, 22), (26, 17*, 28)}, we will observe that the assumed outliers (*) in DE1- DE3 cannot influence the classical/non-robust methods, whereas outliers (*) in DE4-DE6 can highly influence the classical/non-robust methods. The group variance/scale is 1 in both cases of original EE/DE genes, whereas the group variance/scale lies between 1 and 1565 in both cases of outlying EE/DE genes. In the current real data analysis, we found that the proposed method detected 80 DE genes that were not detected by any of the other methods. Among these 80 DE genes, 35 genes were corrupted by outliers. Among these 35 outlying genes, we considered 17 extreme outlying genes expressions for functional analysis to investigate their importance as an example. These outlying genes were then plotted above the lines with four different colors for the T, P, G and CTC groups, where the outlier samples are indicated by red circles above (Fig 3e). To compare all possible comparisons, only genes with significant 2-fold up/down regulation were selected (using adjusted p-values ≤ 0.05) for all methods (Table 9). In this table, we observed a large number of DE genes detected by the proposed method. We searched for functional enrichment of specific pathways by these 17 DE genes using the WebGestalt gene analysis toolkit [30]. By mapping the differentially expressed gene set against the biological function annotations in the Gene Ontology database, we found significant enrichment of genes involved in the positive regulation of muscle-cell differentiation, the immune-response-regulating cell surface receptor signaling pathway, and the antigen-receptor-mediated signaling pathway, as well as genes involved in the carboxylic acid biosynthetic process (S2 File (xls)). Using the KEGG database, we found that several probes mapped to signaling pathways, including the beta-Alanine metabolism pathway, the MAPK signaling pathway and other metabolic pathways (Table 10 and S3 File (xls)). The pathway with the highest expression ratio in CTC was the p38 mitogen-activated protein kinase (p38 MAPK) signaling pathway, which is known to be involved in cancer cell migration. In the p38 MAPK pathway, PP2CB and MEF2C were significantly upregulated. The simulation study results in Table 7 also increase our confidence in the results of the proposed method for this real dataset because that simulated dataset was generated in accordance with the parametric properties of this real dataset.

Bottom Line: It assigns smaller weights (≥ 0) to outlying expressions and larger weights (≤ 1) to typical expressions.The distribution of the β-weights is used to calculate the cut-off point, which is compared to the observed β-weight of an expression to determine whether that gene expression is an outlier.This weight function plays a key role in unifying the robustness and efficiency of estimation in one-way ANOVA.

View Article: PubMed Central - PubMed

Affiliation: Institut Perubatan Molekul UKM (UMBI), University Kebangsaan Malaysia (UKM), Jalan Ya'acob Latiff, Bandar Tun Razak, Cheras 56000 Kuala Lumpur, Malaysia.

ABSTRACT

Background: Identifying genes that are differentially expressed (DE) between two or more conditions with multiple patterns of expression is one of the primary objectives of gene expression data analysis. Several statistical approaches, including one-way analysis of variance (ANOVA), are used to identify DE genes. However, most of these methods provide misleading results for two or more conditions with multiple patterns of expression in the presence of outlying genes. In this paper, an attempt is made to develop a hybrid one-way ANOVA approach that unifies the robustness and efficiency of estimation using the minimum β-divergence method to overcome some problems that arise in the existing robust methods for both small- and large-sample cases with multiple patterns of expression.

Results: The proposed method relies on a β-weight function, which produces values between 0 and 1. The β-weight function with β = 0.2 is used as a measure of outlier detection. It assigns smaller weights (≥ 0) to outlying expressions and larger weights (≤ 1) to typical expressions. The distribution of the β-weights is used to calculate the cut-off point, which is compared to the observed β-weight of an expression to determine whether that gene expression is an outlier. This weight function plays a key role in unifying the robustness and efficiency of estimation in one-way ANOVA.

Conclusion: Analyses of simulated gene expression profiles revealed that all eight methods (ANOVA, SAM, LIMMA, EBarrays, eLNN, KW, robust BetaEB and proposed) perform almost identically for m = 2 conditions in the absence of outliers. However, the robust BetaEB method and the proposed method exhibited considerably better performance than the other six methods in the presence of outliers. In this case, the BetaEB method exhibited slightly better performance than the proposed method for the small-sample cases, but the the proposed method exhibited much better performance than the BetaEB method for both the small- and large-sample cases in the presence of more than 50% outlying genes. The proposed method also exhibited better performance than the other methods for m > 2 conditions with multiple patterns of expression, where the BetaEB was not extended for this condition. Therefore, the proposed approach would be more suitable and reliable on average for the identification of DE genes between two or more conditions with multiple patterns of expression.

No MeSH data available.


Related in: MedlinePlus