Limits...
Protein inter-domain linker prediction using Random Forest and amino acid physiochemical properties.

Shatnawi M, Zaki N, Yoo PD - BMC Bioinformatics (2014)

Bottom Line: Such information not only enhances protein-targeted drug development but also reduces the experimental cost of protein analysis by allowing researchers to work on a set of smaller and independent units.Without applying any data balancing technique such as class weighting and data re-sampling, the proposed approach is able to accurately classify inter-domain linkers from highly imbalanced datasets.Our experimental results prove that the proposed approach is useful for domain-linker identification in highly imbalanced single- and multi-domain proteins.

View Article: PubMed Central - HTML - PubMed

ABSTRACT

Background: Protein chains are generally long and consist of multiple domains. Domains are distinct structural units of a protein that can evolve and function independently. The accurate prediction of protein domain linkers and boundaries is often regarded as the initial step of protein tertiary structure and function predictions. Such information not only enhances protein-targeted drug development but also reduces the experimental cost of protein analysis by allowing researchers to work on a set of smaller and independent units. In this study, we propose a novel and accurate domain-linker prediction approach based on protein primary structure information only. We utilize a nature-inspired machine-learning model called Random Forest along with a novel domain-linker profile that contains physiochemical and domain-linker information of amino acid sequences.

Results: The proposed approach was tested on two well-known benchmark protein datasets and achieved 68% sensitivity and 99% precision, which is better than any existing protein domain-linker predictor. Without applying any data balancing technique such as class weighting and data re-sampling, the proposed approach is able to accurately classify inter-domain linkers from highly imbalanced datasets.

Conclusion: Our experimental results prove that the proposed approach is useful for domain-linker identification in highly imbalanced single- and multi-domain proteins.

Show MeSH
Averaging window optimization. Recall, precision, and F1-score at different window sizes with fifty protein sequences from DomCut dataset.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4290662&req=5

Figure 4: Averaging window optimization. Recall, precision, and F1-score at different window sizes with fifty protein sequences from DomCut dataset.

Mentions: To find the optimal averaging window size, we tested odd window sizes in the range of 7 to 45 residues at randomly selected 50 protein sequences from DS-All dataset [23] and another randomly selected 50 protein sequences from DomCut dataset [3], and then compared the prediction performance at these windows in terms of recall, precision, and F1-score. Figure 3 depicts the performance measures at different sliding windows when applied to the 50 protein sequences of DS-All dataset. Figure 4 shows these prediction measures at different sliding windows when applied to the 50 protein sequences from DomCut dataset. As seen in these two figures, the window size of 41 showed the highest recall, precision and F-measure on both datasets. We thus set the averaging window size to 41 to obtain the final experimental results.


Protein inter-domain linker prediction using Random Forest and amino acid physiochemical properties.

Shatnawi M, Zaki N, Yoo PD - BMC Bioinformatics (2014)

Averaging window optimization. Recall, precision, and F1-score at different window sizes with fifty protein sequences from DomCut dataset.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4290662&req=5

Figure 4: Averaging window optimization. Recall, precision, and F1-score at different window sizes with fifty protein sequences from DomCut dataset.
Mentions: To find the optimal averaging window size, we tested odd window sizes in the range of 7 to 45 residues at randomly selected 50 protein sequences from DS-All dataset [23] and another randomly selected 50 protein sequences from DomCut dataset [3], and then compared the prediction performance at these windows in terms of recall, precision, and F1-score. Figure 3 depicts the performance measures at different sliding windows when applied to the 50 protein sequences of DS-All dataset. Figure 4 shows these prediction measures at different sliding windows when applied to the 50 protein sequences from DomCut dataset. As seen in these two figures, the window size of 41 showed the highest recall, precision and F-measure on both datasets. We thus set the averaging window size to 41 to obtain the final experimental results.

Bottom Line: Such information not only enhances protein-targeted drug development but also reduces the experimental cost of protein analysis by allowing researchers to work on a set of smaller and independent units.Without applying any data balancing technique such as class weighting and data re-sampling, the proposed approach is able to accurately classify inter-domain linkers from highly imbalanced datasets.Our experimental results prove that the proposed approach is useful for domain-linker identification in highly imbalanced single- and multi-domain proteins.

View Article: PubMed Central - HTML - PubMed

ABSTRACT

Background: Protein chains are generally long and consist of multiple domains. Domains are distinct structural units of a protein that can evolve and function independently. The accurate prediction of protein domain linkers and boundaries is often regarded as the initial step of protein tertiary structure and function predictions. Such information not only enhances protein-targeted drug development but also reduces the experimental cost of protein analysis by allowing researchers to work on a set of smaller and independent units. In this study, we propose a novel and accurate domain-linker prediction approach based on protein primary structure information only. We utilize a nature-inspired machine-learning model called Random Forest along with a novel domain-linker profile that contains physiochemical and domain-linker information of amino acid sequences.

Results: The proposed approach was tested on two well-known benchmark protein datasets and achieved 68% sensitivity and 99% precision, which is better than any existing protein domain-linker predictor. Without applying any data balancing technique such as class weighting and data re-sampling, the proposed approach is able to accurately classify inter-domain linkers from highly imbalanced datasets.

Conclusion: Our experimental results prove that the proposed approach is useful for domain-linker identification in highly imbalanced single- and multi-domain proteins.

Show MeSH