Limits...
A systematic study on latent semantic analysis model parameters for mining biomedical literature

View Article: PubMed Central - HTML

AUTOMATICALLY GENERATED EXCERPT
Please rate it.

In this study, empirical analyses were conducted using a previously published 50 gene data set to examine the effects of the following parameters (outlined in Figure 1): Parameters are: (i) stemming, stop-words and word counts (to discard abstract with not enough information), (ii) corpus content (e.g., abstracts with and without titles), (iii) inclusion or exclusion of the dc component or 1Eigen vector (that adds bias to the model), (iv) objective criteria to choose the number of factors (Eigen vectors) to create the model, (v) information theoretic criteria to select features (words in the corpus) instead of considering complete set of features... Top 25 Eigen vectors 2., : energy content within p Eigen vectors, : Energy content with n (all) Eigen vectors 3., n: number of documents, k: indices of Eigen vector, S: singular value In addition, the effect of bias was studied by excluding the 1Eigen vector (dc component)... The best model is defined as the one with relatively high average precision across a set of varied queries... Performance analysis (average precision-recall curves, F-measure etc.) using Gene Ontology classifications corresponding to the 50 gene collection show that not all parameters significantly affect the performance of LSA model (Table 1)... In general, adding titles in addition to the abstracts substantially increased the average precision... In addition, using 0.7/n criteria produced better results than using 25 Eigen vectors or the 97% criteria... It was found that the best performance was achieved by combining 3 parameters: inclusion of title in abstracts in the corpus, exclusion of the dc component, and selection of Eigen vectors based on objective criterion (Figure 2)... This work provides a framework for determining the best parameters in using LSA for ranking genes with respect to queries... Future work will focus on evaluating this framework using different gene document collections.

No MeSH data available.


Average precision vs. rank curve. A combination of three parameters was used: inclusion of titles, exclusion of 1st Eigen vector and 0.7/n objective criterion for factor selection. This combination provides better performance than with individual parameters.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3313269&req=5

Figure 2: Average precision vs. rank curve. A combination of three parameters was used: inclusion of titles, exclusion of 1st Eigen vector and 0.7/n objective criterion for factor selection. This combination provides better performance than with individual parameters.

Mentions: Performance analysis (average precision-recall curves, F-measure etc.) using Gene Ontology classifications corresponding to the 50 gene collection show that not all parameters significantly affect the performance of LSA model (Table 1). In general, adding titles in addition to the abstracts substantially increased the average precision. In addition, using 0.7/n criteria produced better results than using 25 Eigen vectors or the 97% criteria. It was found that the best performance was achieved by combining 3 parameters: inclusion of title in abstracts in the corpus, exclusion of the dc component, and selection of Eigen vectors based on objective criterion (Figure 2). This work provides a framework for determining the best parameters in using LSA for ranking genes with respect to queries. Future work will focus on evaluating this framework using different gene document collections.


A systematic study on latent semantic analysis model parameters for mining biomedical literature
Average precision vs. rank curve. A combination of three parameters was used: inclusion of titles, exclusion of 1st Eigen vector and 0.7/n objective criterion for factor selection. This combination provides better performance than with individual parameters.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3313269&req=5

Figure 2: Average precision vs. rank curve. A combination of three parameters was used: inclusion of titles, exclusion of 1st Eigen vector and 0.7/n objective criterion for factor selection. This combination provides better performance than with individual parameters.
Mentions: Performance analysis (average precision-recall curves, F-measure etc.) using Gene Ontology classifications corresponding to the 50 gene collection show that not all parameters significantly affect the performance of LSA model (Table 1). In general, adding titles in addition to the abstracts substantially increased the average precision. In addition, using 0.7/n criteria produced better results than using 25 Eigen vectors or the 97% criteria. It was found that the best performance was achieved by combining 3 parameters: inclusion of title in abstracts in the corpus, exclusion of the dc component, and selection of Eigen vectors based on objective criterion (Figure 2). This work provides a framework for determining the best parameters in using LSA for ranking genes with respect to queries. Future work will focus on evaluating this framework using different gene document collections.

View Article: PubMed Central - HTML

AUTOMATICALLY GENERATED EXCERPT
Please rate it.

In this study, empirical analyses were conducted using a previously published 50 gene data set to examine the effects of the following parameters (outlined in Figure 1): Parameters are: (i) stemming, stop-words and word counts (to discard abstract with not enough information), (ii) corpus content (e.g., abstracts with and without titles), (iii) inclusion or exclusion of the dc component or 1Eigen vector (that adds bias to the model), (iv) objective criteria to choose the number of factors (Eigen vectors) to create the model, (v) information theoretic criteria to select features (words in the corpus) instead of considering complete set of features... Top 25 Eigen vectors 2., : energy content within p Eigen vectors, : Energy content with n (all) Eigen vectors 3., n: number of documents, k: indices of Eigen vector, S: singular value In addition, the effect of bias was studied by excluding the 1Eigen vector (dc component)... The best model is defined as the one with relatively high average precision across a set of varied queries... Performance analysis (average precision-recall curves, F-measure etc.) using Gene Ontology classifications corresponding to the 50 gene collection show that not all parameters significantly affect the performance of LSA model (Table 1)... In general, adding titles in addition to the abstracts substantially increased the average precision... In addition, using 0.7/n criteria produced better results than using 25 Eigen vectors or the 97% criteria... It was found that the best performance was achieved by combining 3 parameters: inclusion of title in abstracts in the corpus, exclusion of the dc component, and selection of Eigen vectors based on objective criterion (Figure 2)... This work provides a framework for determining the best parameters in using LSA for ranking genes with respect to queries... Future work will focus on evaluating this framework using different gene document collections.

No MeSH data available.