Decision Tree Classifier for Classification of Proteins Using the Protein Data Bank

  • Babasaheb S. Satpute
  • Raghav Yadav
Part of the Studies in Computational Intelligence book series (SCI, volume 771)


Identifying the family of an unknown protein is a challenging problem in computational biology and bioinformatics. Our aim here is to classify proteins into different families and also to identify the family of an unknown protein. For this purpose, we use the surface roughness of the proteins as a criterion. The Protein Data Bank (PDB) is the repository for protein data which contains the Cartesian coordinates of the sequences forming proteins. However, PDB coordinates give no indication of the orientation of the protein, which must be known in order to determine the surface roughness. For this purpose, we designed an invariant coordinate system (ICS) in which we took the origin as the protein center of gravity (CG). From the PDB we obtain the coordinates of all the amino acid residues which form the protein. But we are interested in the surface coordinates only in order to determine the surface similarity. Therefore, we developed a methodology to determine only the surface residues, and we recorded their coordinates. We then divided those coordinates into eight octants based on the signs of the x, y and z coordinates. For the residues in every octant, we found the standard deviation of the coordinates and created a parameter called the surface-invariant coordinate (SIC). Thus, for every protein, we obtained eight SIC values.


Protein classification Structural classification of proteins SCOP Protein data bank PDB Surface-invariant coordinate SIC Decision tree classifier 


  1. 1.
    Connolly, M.L. 1986. Measurement of protein surface shape by solid angles. Journal of Molecular Graphics 4: 3–6.CrossRefGoogle Scholar
  2. 2.
    Richards, Joseph W., and Mark Fetherolf. 2016. Real-world machine learning henrik brink. ISBN 9781617291920.Google Scholar
  3. 3.
    Wang, D., and G.B. Huang. 2005. Protein sequence classification using extreme learning machine. In Proceedings of international joint conference on neural networks (IJCNN, 2005), Montreal, Canada.Google Scholar
  4. 4.
    Datta, A., V. Talukdar, A. Konar, and L.C. Jain. 2009. A neural network based approach for protein structural class prediction. Journal of Intelligent and Fuzzy Systems 20: 61–71.Google Scholar
  5. 5.
    Bandyopadhyay, S. 2005. An efficient technique for super family classification of amino acid sequences: Feature extraction, fuzzy clustering and prototype selection. ELSEVIER Journal of FuzzySets and Systems 152: 5–16.zbMATHGoogle Scholar
  6. 6.
    Ma, P.C.H., and K.C.C. Chan. 2008. UPSEC: An algorithm for classifying unaligned protein sequences into functional families. Journal of Computational Biology 15: 431–443. Scholar
  7. 7.
    Angadi, U.B., and M. Venkatesulu. 2012. Structural SCOP superfamily level classification using unsupervised machine learning. IEEE/ACM Transactions on Computational Biology and Bioinformatics 9: 601–608. Scholar
  8. 8.
  9. 9.

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  1. 1.Department of Computer Science & ITSIET, SHUATSAllahabadIndia

Personalised recommendations