Identification of N-Glycosylation Sites with Sequence and Structural Features Employing Random Forests

  • Shreyas Karnik
  • Joydeep Mitra
  • Arunima Singh
  • B. D. Kulkarni
  • V. Sundarajan
  • V. K. Jayaraman
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5909)


N-Glycosylation plays a very important role in various processes like quality control of proteins produced in ER, transport of proteins and in disease control.The experimental elucidation of N-Glycosylation sites is expensive and laborious process. In this work we build models for identification of potential N-Glycosylation sites in proteins based on sequence and structural features.The best model has cross validation accuracy rate of 72.81%.


Support Vector Machine Random Forest Glycosylation Site Amino Acid Property Contact Order 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Drickamer, K., Taylor, M.E.: Biology of animal lectins. Annual Review of Cell Biology 9(1), 237–264 (1993) PMID: 8280461CrossRefGoogle Scholar
  2. 2.
    Lis, H., Sharon, N.: Lectins: Carbohydrate-specific proteins that mediate cellular recognition. Chemical Reviews 98(2), 637–674 (1998)CrossRefGoogle Scholar
  3. 3.
    Crocker, P.R.: Siglecs: sialic-acid-binding immunoglobulin-like lectins in cell-cell interactions and signalling. Curr. Opin. Struct. Biol. 12(5), 609–615 (2002)CrossRefGoogle Scholar
  4. 4.
    Gavel, Y., Heijne, G.v.: Sequence differences between glycosylated and non- glycosylated Asn-X-Thr/Ser acceptor sites: implications for protein engineering. Protein Eng. 3(5), 433–442 (1990)CrossRefGoogle Scholar
  5. 5.
    Petrescu, A.J., Milac, A.L., Petrescu, S.M., Dwek, R.A., Wormald, M.R.: Statistical analysis of the protein environment of N-glycosylation sites: implications for occupancy, structure, and folding. Glycobiology 14(2), 103–114 (2004)CrossRefGoogle Scholar
  6. 6.
    Gupta, R., Jung, E., Brunak, S.: Netnglyc 1.0 server (Unpublished)Google Scholar
  7. 7.
    Caragea, C., Sinapov, J., Silvescu, A., Dobbs, D., Honavar, V.: Glycosylation site prediction using ensembles of support vector machine classifiers. BMC Bioinformatics 8, 438–438 (2007)CrossRefGoogle Scholar
  8. 8.
    Ben-Dor, S., Esterman, N., Rubin, E., Sharon, N.: Biases and complex patterns in the residues flanking protein N-glycosylation sites. Glycobiology 14(2), 95–101 (2004)CrossRefGoogle Scholar
  9. 9.
    Sussman, J.L., Lin, D., Jiang, J., Manning, N.O., Prilusky, J., Ritter, O., Abola, E.E.: Protein data bank (pdb): database of three-dimensional structural informa- tion of biological macromolecules. Acta Crystallogr. D. Biol. Crystallogr. 54, 1078–1084 (1998)CrossRefGoogle Scholar
  10. 10.
    Li, Z.R., Lin, H.H., Han, L.Y., Jiang, L., Chen, X., Chen, Y.Z.: PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucl. Acids Res. 34, W32–W37 (2006)CrossRefGoogle Scholar
  11. 11.
    Kabsch, W., Sander, C.: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22(12), 2577–2637 (1983)CrossRefGoogle Scholar
  12. 12.
    Kawashima, S., Pokarowski, P., Pokarowska, M., Kolinski, A., Katayama, T., Kanehisa, M.: AAindex: amino acid index database, progress report. Nucl. Acids Res. 36, D202–D205 (2008)CrossRefGoogle Scholar
  13. 13.
    Breiman, L.: Random forests. Machine Learning, 5–32 (2001)Google Scholar
  14. 14.
    Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)zbMATHCrossRefGoogle Scholar
  15. 15.
    Bureau, A., Dupuis, J., Falls, K., Lunetta, K.L., Hayward, B., Keith, T.P., Van Eerdewegh, P.: Identifying SNPs predictive of phenotype using random forests. Genetic Epidemiology 28(2), 171–182 (2005)CrossRefGoogle Scholar
  16. 16.
    Diaz-Uriarte, R., Alvarez de Andres, S.: Gene selection and classification of mi- croarray data using random forest. BMC Bioinformatics 7(1), 3 (2006)CrossRefGoogle Scholar
  17. 17.
    Hamby, S., Hirst, J.: Prediction of glycosylation sites using random forests. BMC Bioinformatics 9, 500 (2008)CrossRefGoogle Scholar
  18. 18.
    Pang, H., Lin, A., Holford, M., Enerson, B.E., Lu, B., Lawton, M.P., Floyd, E., Zhao, H.: Pathway analysis using random forests classification and regression. Bioinformatics (2006)Google Scholar
  19. 19.
    R Development Core Team: R: A Language and Environment for Statistical Com- puting. In: R. Foundation for Statistical Computing, Vienna, Austria (2009) ISBN 3-900051-07-0Google Scholar
  20. 20.
    Liaw, A., Wiener, M.: Classification and regression by randomforest. R. News 2(3), 18–22 (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Shreyas Karnik
    • 1
    • 3
  • Joydeep Mitra
    • 1
  • Arunima Singh
    • 1
  • B. D. Kulkarni
    • 1
  • V. Sundarajan
    • 2
  • V. K. Jayaraman
    • 2
  1. 1.Chemical Engineering and Process Development DivisionNational Chemical LaboratoryPuneIndia
  2. 2.Center for Development of Advanced ComputingPune University CampusPuneIndia
  3. 3.School of InformaticsIndiana UniversityIndianapolisUSA

Personalised recommendations