Limits...
Multi-view methods for protein structure comparison using latent dirichlet allocation.

Shivashankar S, Srivathsan S, Ravindran B, Tendulkar AV - Bioinformatics (2011)

Bottom Line: It involves two major issues: (i) effective protein structure representation that captures inherent relationship between fragments and facilitates efficient comparison between the structures and (ii) effective framework to address different retrieval requirements.In this article, we propose an improved representation of protein structures using latent dirichlet allocation topic model.We compare the proposed representation and retrieval framework on the benchmark dataset developed by Kolodny and co-workers.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science and Engineering, IIT Madras, Chennai-600 036.

ABSTRACT

Motivation: With rapidly expanding protein structure databases, efficiently retrieving structures similar to a given protein is an important problem. It involves two major issues: (i) effective protein structure representation that captures inherent relationship between fragments and facilitates efficient comparison between the structures and (ii) effective framework to address different retrieval requirements. Recently, researchers proposed vector space model of proteins using bag of fragments representation (FragBag), which corresponds to the basic information retrieval model.

Results: In this article, we propose an improved representation of protein structures using latent dirichlet allocation topic model. Another important requirement is to retrieve proteins, whether they are either close or remote homologs. In order to meet diverse objectives, we propose multi-viewpoint based framework that combines multiple representations and retrieval techniques. We compare the proposed representation and retrieval framework on the benchmark dataset developed by Kolodny and co-workers. The results indicate that the proposed techniques outperform state-of-the-art methods.

Availability: http://www.cse.iitm.ac.in/~ashishvt/research/protein-lda/.

Contact: ashishvt@cse.iitm.ac.in.

Show MeSH
Impact of weights, λ1 and λ2, in multi-viewpoint-based IR model. The models are constructed by combining LDA-based topic model with weight λ1 and term frequency (TF) vector space model with weight λ2.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3117356&req=5

Figure 5: Impact of weights, λ1 and λ2, in multi-viewpoint-based IR model. The models are constructed by combining LDA-based topic model with weight λ1 and term frequency (TF) vector space model with weight λ2.

Mentions: Influence of the weights λ1 and λ2 on the retrieval performance is shown in Figure 5. It can be seen that for SAS thresholds of 2 and 3.5, best performance is achieved with a higher weight for LDA representation (λ1). On the other hand, for SAS threshold of 5, best performance is achieved with a higher weight for simple vector space model (or at least equal to LDA representation). As mentioned earlier, SAS threshold 2 tends to retrieve , are for close homologs, and 5 denotes remote homologs. LDA-based representation performs better in identifying close homologs (SAS threshold of 2 Å) than the remote ones (SAS threshold of 5 Å) for a given query protein. Since the fragment functionality overlap is less as we move up the parent tree for a protein structure, exact match using naive vector space model performs better than LDA representation to identify remote homologs. Motivated by the fact that the best results are spanning different libraries, an ensemble on ranking is attempted. For a given query protein structure, similarity produced by a model, say using library 400 (11), and weights λ1=0.6, λ2=0.4 is treated as an independent hypothesis. Output of each model (combination of libraries and λ1, λ2 values) is treated as a hypothesis. The best k hypotheses (empirically chosen) are chosen and are combined using bucket of models strategy. For example, let simX(q,sd) and simY(q,sd) be the similarity scores between a query protein q and a protein sd in the database as provided by the models X and Y, respectively. Using the bucket of model strategy, the similarity between q and sd is given by sim(q,sd)=max(simX(q,sd), simY(q,sd)) This is referred to as a Combined Model. We tested it by combining three best models across SAS thresholds. The best models are 600 (9) with weights (0.8, 0.2) for SAS=2; 400 (11) with weights (0.7, 0.3) for SAS=3.5; 400 (11) with weights (0.4, 0.6) for SAS=5. Results for the Combined Model is given in Table 9.Fig. 5.


Multi-view methods for protein structure comparison using latent dirichlet allocation.

Shivashankar S, Srivathsan S, Ravindran B, Tendulkar AV - Bioinformatics (2011)

Impact of weights, λ1 and λ2, in multi-viewpoint-based IR model. The models are constructed by combining LDA-based topic model with weight λ1 and term frequency (TF) vector space model with weight λ2.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3117356&req=5

Figure 5: Impact of weights, λ1 and λ2, in multi-viewpoint-based IR model. The models are constructed by combining LDA-based topic model with weight λ1 and term frequency (TF) vector space model with weight λ2.
Mentions: Influence of the weights λ1 and λ2 on the retrieval performance is shown in Figure 5. It can be seen that for SAS thresholds of 2 and 3.5, best performance is achieved with a higher weight for LDA representation (λ1). On the other hand, for SAS threshold of 5, best performance is achieved with a higher weight for simple vector space model (or at least equal to LDA representation). As mentioned earlier, SAS threshold 2 tends to retrieve , are for close homologs, and 5 denotes remote homologs. LDA-based representation performs better in identifying close homologs (SAS threshold of 2 Å) than the remote ones (SAS threshold of 5 Å) for a given query protein. Since the fragment functionality overlap is less as we move up the parent tree for a protein structure, exact match using naive vector space model performs better than LDA representation to identify remote homologs. Motivated by the fact that the best results are spanning different libraries, an ensemble on ranking is attempted. For a given query protein structure, similarity produced by a model, say using library 400 (11), and weights λ1=0.6, λ2=0.4 is treated as an independent hypothesis. Output of each model (combination of libraries and λ1, λ2 values) is treated as a hypothesis. The best k hypotheses (empirically chosen) are chosen and are combined using bucket of models strategy. For example, let simX(q,sd) and simY(q,sd) be the similarity scores between a query protein q and a protein sd in the database as provided by the models X and Y, respectively. Using the bucket of model strategy, the similarity between q and sd is given by sim(q,sd)=max(simX(q,sd), simY(q,sd)) This is referred to as a Combined Model. We tested it by combining three best models across SAS thresholds. The best models are 600 (9) with weights (0.8, 0.2) for SAS=2; 400 (11) with weights (0.7, 0.3) for SAS=3.5; 400 (11) with weights (0.4, 0.6) for SAS=5. Results for the Combined Model is given in Table 9.Fig. 5.

Bottom Line: It involves two major issues: (i) effective protein structure representation that captures inherent relationship between fragments and facilitates efficient comparison between the structures and (ii) effective framework to address different retrieval requirements.In this article, we propose an improved representation of protein structures using latent dirichlet allocation topic model.We compare the proposed representation and retrieval framework on the benchmark dataset developed by Kolodny and co-workers.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science and Engineering, IIT Madras, Chennai-600 036.

ABSTRACT

Motivation: With rapidly expanding protein structure databases, efficiently retrieving structures similar to a given protein is an important problem. It involves two major issues: (i) effective protein structure representation that captures inherent relationship between fragments and facilitates efficient comparison between the structures and (ii) effective framework to address different retrieval requirements. Recently, researchers proposed vector space model of proteins using bag of fragments representation (FragBag), which corresponds to the basic information retrieval model.

Results: In this article, we propose an improved representation of protein structures using latent dirichlet allocation topic model. Another important requirement is to retrieve proteins, whether they are either close or remote homologs. In order to meet diverse objectives, we propose multi-viewpoint based framework that combines multiple representations and retrieval techniques. We compare the proposed representation and retrieval framework on the benchmark dataset developed by Kolodny and co-workers. The results indicate that the proposed techniques outperform state-of-the-art methods.

Availability: http://www.cse.iitm.ac.in/~ashishvt/research/protein-lda/.

Contact: ashishvt@cse.iitm.ac.in.

Show MeSH