Journal of Bionic Engineering

, Volume 5, Issue 3, pp 215–223 | Cite as

A Novel Method for Prediction of Protein Domain Using Distance-Based Maximal Entropy

  • Shu-xue Zou
  • Yan-xin Huang
  • Yan Wang
  • Chun-guang ZhoEmail author


Detecting the boundaries of protein domains is an important and challenging task in both experimental and computational structural biology. In this paper, a promising method for detecting the domain structure of a protein from sequence information alone is presented. The method is based on analyzing multiple sequence alignments derived from a database search. Multiple measures are defined to quantify the domain information content of each position along the sequence. Then they are combined into a single predictor using support vector machine. What is more important, the domain detection is first taken as an imbalanced data learning problem. A novel undersampling method is proposed on distance-based maximal entropy in the feature space of Support Vector Machine (SVM). The overall precision is about 80%. Simulation results demonstrate that the method can help not only in predicting the complete 3D structure of a protein but also in the machine learning system on general imbalanced datasets.


protein domain boundary SVM imbalanced data learning distance-based maximal entropy 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    Orengo C A, Michie A D, Jones S, Jones D T, Swindells M B, Thornton J M. CATH-a hierarchic classification of protein domain structures. Structure, 1997, 5, 1093–1108.CrossRefGoogle Scholar
  2. [2]
    Murzin A G, Brenner S E, Hubbard T, Chothia C. SCOP: A structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 1995, 247, 536–540.Google Scholar
  3. [3]
    Alexandrov N, Shindyalov I. PDP: Protein domain parser. Bioinformatics, 2003, 19, 429–430.CrossRefGoogle Scholar
  4. [4]
    Holm L, Sander C. Mapping the protein universe. Science, 1996, 273, 595–603.CrossRefGoogle Scholar
  5. [5]
    Bateman A, Birney E, Durbin R, Eddy S R, Finn R D, Sonnhammer E L. Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins. Nucleic Acids Res., 1999, 27, 260–262.CrossRefGoogle Scholar
  6. [6]
    Ponting C P, Schultz J, Milpetz F, Bork P. SMART: Identification and annotation of domains from signaling and extracellular protein sequences. Nucleic Acids Res., 1999, 27, 229–232.CrossRefGoogle Scholar
  7. [7]
    Sonnhammer E L, Kahn D. Modular arrangement of proteins as inferred from analysis of homology. Protein Science, 1994, 3, 482–492.CrossRefGoogle Scholar
  8. [8]
    Gracy J, Argos P. Automated protein sequence database classification. I. Integration of copositional similarity search, local similarity search and multiple sequence alignment. Bioinformatics, 1998, 14, 164–187.CrossRefGoogle Scholar
  9. [9]
    Tong S, Chang E. Support vector machine active learning for image retrieval. Proceedings of the Ninth ACM International Conference on Multimedia, 2001, 9, 107–118.CrossRefGoogle Scholar
  10. [10]
    Joachims T. Text categorization with support vector machines: Learning with many relevant features. Proceedings of ECML-98, 10th European Conference on Machine Learning, Chemnitz, Germany, 1998, 137–142.Google Scholar
  11. [11]
    Wu G, Chang E Y. Class-boundary alignment for imbalanced dataset learning. In ICML 2003 Workshop on Learning from Imbalanced Data Sets II, Washington, DC, 2003.Google Scholar
  12. [12]
    Kosiol C, Goldman N, Buttimore N H. A new criterion and method for amino acid classification. Journal of Theoretical Biology, 2004, 228, 97–106.MathSciNetCrossRefGoogle Scholar
  13. [13]
    Nagarajan N, Yona G. Automatic prediction of protein domains from sequence information using a hybrid learn system. Bioinformatics, 2004, 20, 1335–1360.CrossRefGoogle Scholar
  14. [14]
    Galzitskaya O V, Melnik B S. Prediction of protein domain boundaries from sequence alone. Protein Science, 2003, 12, 696–701.CrossRefGoogle Scholar
  15. [15]
    Cristianini N, Shawe-Taylor J. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods, Cambridge University Press, Cambridge, UK, 2000.CrossRefGoogle Scholar
  16. [16]
    Akbani R, Kwek S, Japkowicz N. Applying support vector machines to imbalanced datasets. Proc. 15th. European Conf. Machine Learning (ECML), Pisa, Italy, 2004, 39–50.Google Scholar
  17. [17]
    Veropoulos K, Campbell C, Cristianini N. Controlling the sensitivity of support vector machines. Proceedings of the International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 1999, 55–60.Google Scholar
  18. [18]
    Kotsiantis S, Kanellopoulos D, Pintelas P. Handling imbalanced datasets: A review. GESTS International Transactions on Computer Science and Engineering, 2006, 30, 25–36.Google Scholar
  19. [19]
    Bradley A P. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 1997, 30, 1145–1159.CrossRefGoogle Scholar

Copyright information

© Jilin University 2008

Authors and Affiliations

  • Shu-xue Zou
    • 1
  • Yan-xin Huang
    • 1
  • Yan Wang
    • 1
  • Chun-guang Zho
    • 1
    Email author
  1. 1.Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and TechnologyJilin UniversityChangchunP. R. China

Personalised recommendations