A Novel Method for Prediction of Protein Domain Using Distance-Based Maximal Entropy
Detecting the boundaries of protein domains is an important and challenging task in both experimental and computational structural biology. In this paper, a promising method for detecting the domain structure of a protein from sequence information alone is presented. The method is based on analyzing multiple sequence alignments derived from a database search. Multiple measures are defined to quantify the domain information content of each position along the sequence. Then they are combined into a single predictor using support vector machine. What is more important, the domain detection is first taken as an imbalanced data learning problem. A novel undersampling method is proposed on distance-based maximal entropy in the feature space of Support Vector Machine (SVM). The overall precision is about 80%. Simulation results demonstrate that the method can help not only in predicting the complete 3D structure of a protein but also in the machine learning system on general imbalanced datasets.
Keywordsprotein domain boundary SVM imbalanced data learning distance-based maximal entropy
Unable to display preview. Download preview PDF.
- Murzin A G, Brenner S E, Hubbard T, Chothia C. SCOP: A structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 1995, 247, 536–540.Google Scholar
- Joachims T. Text categorization with support vector machines: Learning with many relevant features. Proceedings of ECML-98, 10th European Conference on Machine Learning, Chemnitz, Germany, 1998, 137–142.Google Scholar
- Wu G, Chang E Y. Class-boundary alignment for imbalanced dataset learning. In ICML 2003 Workshop on Learning from Imbalanced Data Sets II, Washington, DC, 2003.Google Scholar
- Akbani R, Kwek S, Japkowicz N. Applying support vector machines to imbalanced datasets. Proc. 15th. European Conf. Machine Learning (ECML), Pisa, Italy, 2004, 39–50.Google Scholar
- Veropoulos K, Campbell C, Cristianini N. Controlling the sensitivity of support vector machines. Proceedings of the International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 1999, 55–60.Google Scholar
- Kotsiantis S, Kanellopoulos D, Pintelas P. Handling imbalanced datasets: A review. GESTS International Transactions on Computer Science and Engineering, 2006, 30, 25–36.Google Scholar