Limits...
A comparative study of improvements Pre-filter methods bring on feature selection using microarray data.

Wang Y, Fan X, Cai Y - Health Inf Sci Syst (2014)

Bottom Line: Results showed that pre-filter methods could reduce the number of features greatly for both mRNA and microRNA expression datasets.The features selected after pre-filter procedures were shown to be significant in biological levels such as biology process and microRNA functions.With similar or better classification improvements, less but biological significant features, pre-filter-based feature selection should be taken into consideration if researchers need fast results when facing complex computing problems in bioinformatics.

View Article: PubMed Central - PubMed

Affiliation: Research Center for Biomedical Information, Shenzhen Institutes of Advanced Technologies, Chinese Academy of Sciences, Shenzhen, China.

ABSTRACT

Background: Feature selection techniques have become an apparent need in biomarker discoveries with the development of microarray. However, the high dimensional nature of microarray made feature selection become time-consuming. To overcome such difficulties, filter data according to the background knowledge before applying feature selection techniques has become a hot topic in microarray analysis. Different methods may affect final results greatly, thus it is important to evaluate these pre-filter methods in a system way.

Methods: In this paper, we compared the performance of statistical-based, biological-based pre-filter methods and the combination of them on microRNA-mRNA parallel expression profiles using L1 logistic regression as feature selection techniques. Four types of data were built for both microRNA and mRNA expression profiles.

Results: Results showed that pre-filter methods could reduce the number of features greatly for both mRNA and microRNA expression datasets. The features selected after pre-filter procedures were shown to be significant in biological levels such as biology process and microRNA functions. Analyses of classification performance based on precision showed the pre-filter methods were necessary when the number of raw features was much bigger than that of samples. All the computing time was greatly shortened after pre-filter procedures.

Conclusions: With similar or better classification improvements, less but biological significant features, pre-filter-based feature selection should be taken into consideration if researchers need fast results when facing complex computing problems in bioinformatics.

No MeSH data available.


Related in: MedlinePlus

Overlaps among the selected features of the 10 mRNA datasets on 4 levels. The numbers in the figures stand for the numbers of common features between the corresponding datasets. (a) Overlaps on GO-BP level; (b) Overlaps on GO-MF level; (c) Overlaps on GO-CC level; (d) Overlaps on GO-Pathway level.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4340279&req=5

Fig3: Overlaps among the selected features of the 10 mRNA datasets on 4 levels. The numbers in the figures stand for the numbers of common features between the corresponding datasets. (a) Overlaps on GO-BP level; (b) Overlaps on GO-MF level; (c) Overlaps on GO-CC level; (d) Overlaps on GO-Pathway level.

Mentions: There were overlaps among the selected features of the 10 mRNA datasets (As shown in Figure 3(a-d)). For datasets constructed based on GO-BP, the numbers of shared genes were big. Only 8.3% and 7.45% genes in type3-BP and type 4-BP were covered by one datasets of type 1, type 2, typ 3-BP, and type 4-BP. It was interesting that type 3-BP dataset kept only 60 genes as selected genes; however, these genes were enriched in 67.19% of HCM related genes’ enriched GO BP terms. Of these terms, we could see the important biological processes related to HCM such as adult heart development (GO: 0007512), cardiac muscle tissue development (GO: 0048738), muscle system process (GO: 0003012), vasculature development (GO: 0001944), and vasculogenesis (GO: 001570), etc. were covered in this dataset. 55 of these 60 genes were covered by type 1, type 2, and type 4-BP datasets as shown in Figure 3(a). Compared with GO-BP, datasets constructed based on GO-MF showed different results especially type 4-MF, of which only 35.11% genes were in the overlaps among 4 datasets type 1, type 2, typ 3-MF, and type 4-MF (See Figure 3(b) for details). Nearly half of the selected genes (48.57% and 48.79%, respectively) appeared at least twice in type 3-CC and type 3-Pathway (See Figure 3(c-d)). In type 4-CC and type 4-Pathway, over 66% of the selected genes (66.95% and 69.29%, respectively) appeared at least twice (See Figure 3(c-d)).Figure 3


A comparative study of improvements Pre-filter methods bring on feature selection using microarray data.

Wang Y, Fan X, Cai Y - Health Inf Sci Syst (2014)

Overlaps among the selected features of the 10 mRNA datasets on 4 levels. The numbers in the figures stand for the numbers of common features between the corresponding datasets. (a) Overlaps on GO-BP level; (b) Overlaps on GO-MF level; (c) Overlaps on GO-CC level; (d) Overlaps on GO-Pathway level.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4340279&req=5

Fig3: Overlaps among the selected features of the 10 mRNA datasets on 4 levels. The numbers in the figures stand for the numbers of common features between the corresponding datasets. (a) Overlaps on GO-BP level; (b) Overlaps on GO-MF level; (c) Overlaps on GO-CC level; (d) Overlaps on GO-Pathway level.
Mentions: There were overlaps among the selected features of the 10 mRNA datasets (As shown in Figure 3(a-d)). For datasets constructed based on GO-BP, the numbers of shared genes were big. Only 8.3% and 7.45% genes in type3-BP and type 4-BP were covered by one datasets of type 1, type 2, typ 3-BP, and type 4-BP. It was interesting that type 3-BP dataset kept only 60 genes as selected genes; however, these genes were enriched in 67.19% of HCM related genes’ enriched GO BP terms. Of these terms, we could see the important biological processes related to HCM such as adult heart development (GO: 0007512), cardiac muscle tissue development (GO: 0048738), muscle system process (GO: 0003012), vasculature development (GO: 0001944), and vasculogenesis (GO: 001570), etc. were covered in this dataset. 55 of these 60 genes were covered by type 1, type 2, and type 4-BP datasets as shown in Figure 3(a). Compared with GO-BP, datasets constructed based on GO-MF showed different results especially type 4-MF, of which only 35.11% genes were in the overlaps among 4 datasets type 1, type 2, typ 3-MF, and type 4-MF (See Figure 3(b) for details). Nearly half of the selected genes (48.57% and 48.79%, respectively) appeared at least twice in type 3-CC and type 3-Pathway (See Figure 3(c-d)). In type 4-CC and type 4-Pathway, over 66% of the selected genes (66.95% and 69.29%, respectively) appeared at least twice (See Figure 3(c-d)).Figure 3

Bottom Line: Results showed that pre-filter methods could reduce the number of features greatly for both mRNA and microRNA expression datasets.The features selected after pre-filter procedures were shown to be significant in biological levels such as biology process and microRNA functions.With similar or better classification improvements, less but biological significant features, pre-filter-based feature selection should be taken into consideration if researchers need fast results when facing complex computing problems in bioinformatics.

View Article: PubMed Central - PubMed

Affiliation: Research Center for Biomedical Information, Shenzhen Institutes of Advanced Technologies, Chinese Academy of Sciences, Shenzhen, China.

ABSTRACT

Background: Feature selection techniques have become an apparent need in biomarker discoveries with the development of microarray. However, the high dimensional nature of microarray made feature selection become time-consuming. To overcome such difficulties, filter data according to the background knowledge before applying feature selection techniques has become a hot topic in microarray analysis. Different methods may affect final results greatly, thus it is important to evaluate these pre-filter methods in a system way.

Methods: In this paper, we compared the performance of statistical-based, biological-based pre-filter methods and the combination of them on microRNA-mRNA parallel expression profiles using L1 logistic regression as feature selection techniques. Four types of data were built for both microRNA and mRNA expression profiles.

Results: Results showed that pre-filter methods could reduce the number of features greatly for both mRNA and microRNA expression datasets. The features selected after pre-filter procedures were shown to be significant in biological levels such as biology process and microRNA functions. Analyses of classification performance based on precision showed the pre-filter methods were necessary when the number of raw features was much bigger than that of samples. All the computing time was greatly shortened after pre-filter procedures.

Conclusions: With similar or better classification improvements, less but biological significant features, pre-filter-based feature selection should be taken into consideration if researchers need fast results when facing complex computing problems in bioinformatics.

No MeSH data available.


Related in: MedlinePlus