Enriched Bag of Words for Protein Remote Homology Detection

  • Andrea Cucci
  • Pietro Lovato
  • Manuele BicegoEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10029)


One of the most challenging Pattern Recognition problems in Bioinformatics is to detect if two proteins that show very low sequence similarity are functionally or structurally related – this is the so-called Protein Remote Homology Detection (PRHD) problem. Even if in this context approaches based on the “Bag of Words” (BoW) paradigm showed high potential, there is still room for further refinements, especially by considering the peculiar application context. In this paper we proposed a modified BoW representation for PRHD, which enriches the classic BoW with information derived from the evolutionary history of mutations each protein is subjected to. An experimental comparison on a standard benchmark demonstrates the feasibility of the proposed technique.


Bag of words N-grams Sequence classification 


  1. 1.
    Altschul, S.F., Madden, T.L., Schffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acid Res. 25(17), 3389–3402 (1997)CrossRefGoogle Scholar
  2. 2.
    Bicego, M., Lovato, P., Perina, A., Fasoli, M., Delledonne, M., Pezzotti, M., Polverari, A., Murino, V.: Investigating topic models’ capabilities in expression microarray data classification. IEEE/ACM Trans. Comput. Biol. Bioinform. 9(6), 1831–1836 (2012)CrossRefGoogle Scholar
  3. 3.
    Brelstaff, G., Bicego, M., Culeddu, N., Chessa, M.: Bag of peaks: interpretation of nmr spectrometry. Bioinformatics 25(2), 258–264 (2009)CrossRefGoogle Scholar
  4. 4.
    Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: Workshop on Statistical Learning in Computer Vision, ECCV, pp. 1–22 (2004)Google Scholar
  5. 5.
    Dong, Q., Lin, L., Wang, X.: Protein remote homology detection based on binary profiles. In: Hochreiter, S., Wagner, R. (eds.) BIRD 2007. LNCS, vol. 4414, pp. 212–223. Springer, Heidelberg (2007). doi: 10.1007/978-3-540-71233-6_17 CrossRefGoogle Scholar
  6. 6.
    Dong, Q., Wang, X., Lin, L.: Application of latent semantic analysis to protein remote homology detection. Bioinformatics 22(3), 285–290 (2006)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Fox, N.K., Brenner, S.E., Chandonia, J.: SCOPe: structural classification of proteins - extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res. 42(Database–Issue), 304–309 (2014)CrossRefGoogle Scholar
  8. 8.
    Gribskov, M., Robinson, N.L.: Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Comput. Chem. 20(1), 25–33 (1996)CrossRefGoogle Scholar
  9. 9.
    Henikoff, S., Henikoff, J.: Amino acid substitution matrices from protein blocks. PNAS 89(22), 10915–10919 (1992)CrossRefGoogle Scholar
  10. 10.
    Karplus, K., Barrett, C., Hughey, R.: Hidden Markov models for detecting remote protein homologies. Bioinformatics 14, 846–856 (1998)CrossRefGoogle Scholar
  11. 11.
    Kuang, R., Ie, E., Wang, K., Wang, K., Siddiqi, M., Freund, Y., Leslie, C.: Profile-based string kernels for remote homology detection and motif extraction. J. Bioinform. Comput. Biol. 3(03), 527–550 (2005)CrossRefGoogle Scholar
  12. 12.
    Liao, L., Noble, W.S.: Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships. J. Comput. Biol. 10(6), 857–868 (2003)CrossRefGoogle Scholar
  13. 13.
    Liu, B., Wang, X., Chen, Q., Dong, Q., Lan, X.: Using amino acid physicochemical distance transformation for fast protein remote homology detection. PLoS ONE 7(9), e46633 (2012)CrossRefGoogle Scholar
  14. 14.
    Liu, B., Wang, X., Lin, L., Dong, Q., Wang, X.: A discriminative method for protein remote homology detection and fold recognition combining top-n-grams and latent semantic analysis. BMC Bioinf. 9(1), 510 (2008)CrossRefGoogle Scholar
  15. 15.
    Liu, B., Zhang, D., Xu, R., Xu, J., Wang, X., Chen, Q., Dong, Q., Chou, K.C.: Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection. Bioinformatics 30(4), 472–479 (2014)CrossRefGoogle Scholar
  16. 16.
    Lovato, P., Giorgetti, A., Bicego, M.: A multimodal approach for protein remote homology detection. IEEE/ACM Trans. Comput. Biol. Bioinform. 12(5), 1193–1198 (2015)CrossRefGoogle Scholar
  17. 17.
    Marszaek, M., Schmid, C.: Spatial weighting for bag-of-features. In: Proceedings of International Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 2118–2125 (2006)Google Scholar
  18. 18.
    Pevsner, J.: Bioinformatics and Functional Genomics. Wiley, Hoboken (2003)Google Scholar
  19. 19.
    Rangwala, H., Karypis, G.: Profile-based direct kernels for remote homology detection and fold recognition. Bioinformatics 21(23), 4239–4247 (2005)CrossRefGoogle Scholar
  20. 20.
    Saigo, H., Vert, J.P., Ueda, N., Akutsu, T.: Protein homology detection using string alignment kernels. Bioinformatics 20(11), 1682–1689 (2004)CrossRefGoogle Scholar
  21. 21.
    Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Inc., New York (1986)zbMATHGoogle Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  1. 1.Dipartimento di Informatica - Ca’ Vignal 2Università degli Studi di VeronaVeronaItaly

Personalised recommendations