Limits...
Developing eThread pipeline using SAGA-pilot abstraction for large-scale structural bioinformatics.

Ragothaman A, Boddu SC, Kim N, Feinstein W, Brylinski M, Jha S, Kim J - Biomed Res Int (2014)

Bottom Line: To leverage its utility, we have developed a pipeline for eThread--a meta-threading protein structure modeling tool, that can use computational resources efficiently and effectively.Based on results, we suggest a pathway to an optimized solution with respect to metrics such as time-to-solution or cost-to-solution.The developed pipeline is easily extensible to other types of distributed cyberinfrastructure.

View Article: PubMed Central - PubMed

Affiliation: RADICAL, ECE, Rutgers University, New Brunswick, NJ 08901, USA.

ABSTRACT
While most of computational annotation approaches are sequence-based, threading methods are becoming increasingly attractive because of predicted structural information that could uncover the underlying function. However, threading tools are generally compute-intensive and the number of protein sequences from even small genomes such as prokaryotes is large typically containing many thousands, prohibiting their application as a genome-wide structural systems biology tool. To leverage its utility, we have developed a pipeline for eThread--a meta-threading protein structure modeling tool, that can use computational resources efficiently and effectively. We employ a pilot-based approach that supports seamless data and task-level parallelism and manages large variation in workload and computational requirements. Our scalable pipeline is deployed on Amazon EC2 and can efficiently select resources based upon task requirements. We present runtime analysis to characterize computational complexity of eThread and EC2 infrastructure. Based on results, we suggest a pathway to an optimized solution with respect to metrics such as time-to-solution or cost-to-solution. Our eThread pipeline can scale to support a large number of sequences and is expected to be a viable solution for genome-scale structural bioinformatics and structure-based annotation, particularly, amenable for small genomes such as prokaryotes. The developed pipeline is easily extensible to other types of distributed cyberinfrastructure.

Show MeSH
Time-to-solution of each elementary step in the pipeline using 2 heterogeneous VMs (a). Single VM results are presented for comparison (b). Results are obtained with 20 sequences and pfTools are used.
© Copyright Policy - open-access
Related In: Results  -  Collection


getmorefigures.php?uid=PMC4066679&req=5

fig6: Time-to-solution of each elementary step in the pipeline using 2 heterogeneous VMs (a). Single VM results are presented for comparison (b). Results are obtained with 20 sequences and pfTools are used.

Mentions: The most significant factor for such discrepancy is understandable with the current implementation for concurrent executions of subtasks. With the efficient job monitoring capacity provided by SAGA-Pilot, all available cores in multicore instances are guaranteed to be utilized and thus contribute speed-up, but there still exists an inefficient utilization of resources. For example, when subtasks corresponding to 20 sequences are distributed into 16 cores, it is highly likely to have idling cores that complete assigned tasks early but need to wait until the last task to be ended by other cores. This is apparently indicated with the fact that the difference from the ideal limit is less significant with the single core instance, m1.small, compared to the dual core c1.large and more apparently to 16-core hi1.4xlarge. This suggests strongly that a better strategy is needed to fully utilize all cores during the time-to-solution. Less computationally demanding tasks with certain tools are more likely affected by nonmajor tasks such as VM launch and another overhead, but overall the expected portion is minimal, suggesting that, to optimize the entire eThread, the key strategy should be the combination of efficient input data parallelization as well as speed-up of tools such as THREADER and “high” computation tools. As we demonstrated, if the case is with mixed instances (see Figure 6), more complicated underlying mechanisms should be considered arising from different CPU performance, the number of cores, and others such as memory. Finally, many features associated with EC2 are not easy to understand with respect to the performance. For example, we observed that the performance of t1.micro is difficult to predict, which can be glimpsed with the two different experiments presented in Figure 5 and Table 4. Two data sets clearly show that t1.micro produces very different outcomes from other instances, in particular, indicated with the relative ratio between chain and domain. Also, in many cases, t1.micro produced unpredictable performance and we suspect, and the information from the explanation from Amazon website, that this is due to a special configuration for this instance to be optimized for low throughput and to be cheaper but not appropriate for computation requiring consistent performance.


Developing eThread pipeline using SAGA-pilot abstraction for large-scale structural bioinformatics.

Ragothaman A, Boddu SC, Kim N, Feinstein W, Brylinski M, Jha S, Kim J - Biomed Res Int (2014)

Time-to-solution of each elementary step in the pipeline using 2 heterogeneous VMs (a). Single VM results are presented for comparison (b). Results are obtained with 20 sequences and pfTools are used.
© Copyright Policy - open-access
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC4066679&req=5

fig6: Time-to-solution of each elementary step in the pipeline using 2 heterogeneous VMs (a). Single VM results are presented for comparison (b). Results are obtained with 20 sequences and pfTools are used.
Mentions: The most significant factor for such discrepancy is understandable with the current implementation for concurrent executions of subtasks. With the efficient job monitoring capacity provided by SAGA-Pilot, all available cores in multicore instances are guaranteed to be utilized and thus contribute speed-up, but there still exists an inefficient utilization of resources. For example, when subtasks corresponding to 20 sequences are distributed into 16 cores, it is highly likely to have idling cores that complete assigned tasks early but need to wait until the last task to be ended by other cores. This is apparently indicated with the fact that the difference from the ideal limit is less significant with the single core instance, m1.small, compared to the dual core c1.large and more apparently to 16-core hi1.4xlarge. This suggests strongly that a better strategy is needed to fully utilize all cores during the time-to-solution. Less computationally demanding tasks with certain tools are more likely affected by nonmajor tasks such as VM launch and another overhead, but overall the expected portion is minimal, suggesting that, to optimize the entire eThread, the key strategy should be the combination of efficient input data parallelization as well as speed-up of tools such as THREADER and “high” computation tools. As we demonstrated, if the case is with mixed instances (see Figure 6), more complicated underlying mechanisms should be considered arising from different CPU performance, the number of cores, and others such as memory. Finally, many features associated with EC2 are not easy to understand with respect to the performance. For example, we observed that the performance of t1.micro is difficult to predict, which can be glimpsed with the two different experiments presented in Figure 5 and Table 4. Two data sets clearly show that t1.micro produces very different outcomes from other instances, in particular, indicated with the relative ratio between chain and domain. Also, in many cases, t1.micro produced unpredictable performance and we suspect, and the information from the explanation from Amazon website, that this is due to a special configuration for this instance to be optimized for low throughput and to be cheaper but not appropriate for computation requiring consistent performance.

Bottom Line: To leverage its utility, we have developed a pipeline for eThread--a meta-threading protein structure modeling tool, that can use computational resources efficiently and effectively.Based on results, we suggest a pathway to an optimized solution with respect to metrics such as time-to-solution or cost-to-solution.The developed pipeline is easily extensible to other types of distributed cyberinfrastructure.

View Article: PubMed Central - PubMed

Affiliation: RADICAL, ECE, Rutgers University, New Brunswick, NJ 08901, USA.

ABSTRACT
While most of computational annotation approaches are sequence-based, threading methods are becoming increasingly attractive because of predicted structural information that could uncover the underlying function. However, threading tools are generally compute-intensive and the number of protein sequences from even small genomes such as prokaryotes is large typically containing many thousands, prohibiting their application as a genome-wide structural systems biology tool. To leverage its utility, we have developed a pipeline for eThread--a meta-threading protein structure modeling tool, that can use computational resources efficiently and effectively. We employ a pilot-based approach that supports seamless data and task-level parallelism and manages large variation in workload and computational requirements. Our scalable pipeline is deployed on Amazon EC2 and can efficiently select resources based upon task requirements. We present runtime analysis to characterize computational complexity of eThread and EC2 infrastructure. Based on results, we suggest a pathway to an optimized solution with respect to metrics such as time-to-solution or cost-to-solution. Our eThread pipeline can scale to support a large number of sequences and is expected to be a viable solution for genome-scale structural bioinformatics and structure-based annotation, particularly, amenable for small genomes such as prokaryotes. The developed pipeline is easily extensible to other types of distributed cyberinfrastructure.

Show MeSH