Skip to main content

Community Detection-Based Feature Construction for Protein Sequence Classification

  • Conference paper
Bioinformatics Research and Applications (ISBRA 2015)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 9096))

Included in the following conference series:

Abstract

Machine learning algorithms are widely used to annotate biological sequences. Low-dimensional informative feature vectors can be crucial for the performance of the algorithms. In prior work, we have proposed the use of a community detection approach to construct low dimensional feature sets for nucleotide sequence classification. Our approach uses the Hamming distance between short nucleotide subsequences, called k-mers, to construct a network, and subsequently uses community detection to identify groups of k-mers that appear frequently in a set of sequences. While this approach worked well for nucleotide sequence classification, it could not be directly used for protein sequences, as the Hamming distance is not a good measure for comparing short protein k-mers. To address this limitation, we extend our prior approach by replacing the Hamming distance with substitution scores. Experimental results in different learning scenarios show that the features generated with the new approach are more informative than k-mers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Leslie, C.S., Eskin, E., Cohen, A., Weston, J., Noble, W.S.: Mismatch string kernels for discriminative protein classification. Bioinformatics 20(4), 467–476 (2004)

    Article  Google Scholar 

  2. Caragea, C., Silvescu, A., Mitra, P.: Protein sequence classification using feature hashing. Proteome Science 10(1), 1–8 (2012)

    Article  Google Scholar 

  3. Sun, L., Luo, H., Bu, D., Zhao, G., Yu, K., Zhang, C., Liu, Y., Chen, R., Zhao, Y.: Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts. Nucleic Acids Research (2013)

    Google Scholar 

  4. Clauset, A., Newman, M.E.J., Moore, C.: Finding community structure in very large networks. Physical Review E, 1–6 (2004)

    Google Scholar 

  5. Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C.: A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure 5(suppl. 3), 345–351 (1978)

    Google Scholar 

  6. Emanuelsson, O., Nielsen, H., Brunak, S., Heijne, G.: Predicting subcellular localization of proteins based on their n-terminal amino acid sequence. Journal of Molecular Biology 300(4), 1005–1016 (2000)

    Article  Google Scholar 

  7. Gardy, J.L., Laird, M.R., Chen, F., Rey, S., Walsh, C.J., Ester, M., Brinkman, F.S.L.: Psortb v.2.0: Expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics 21(5), 617–623 (2005)

    Article  Google Scholar 

  8. Girvan, M., Newman, M.E.J.: Community structure in social and biological networks. Proceedings of the National Academy of Sciences 99(12), 7821–7826 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  9. Guimera, R., Sales-Pardo, M., Amaral Modularity, L.A.N.: from fluctuations in random graphs and complex networks. Phys. Rev. E 70(025101) (2004)

    Google Scholar 

  10. Massen, C.P., Doye, J.P.K.: Identifying communities within energy landscapes. Phys. Rev. E 71(046101) (2005)

    Google Scholar 

  11. Medus, A., Acuna, G., Dorso, C.: Detection of community structures in networks via global optimization. Physica A: Statistical Mechanics and its Applications 358(2), 593–604 (2005)

    Article  Google Scholar 

  12. Guimera, R., Amaral, L.A.N.: Functional cartography of complex metabolic networks. Nature 433(7028), 895–900 (2005)

    Article  Google Scholar 

  13. Henikoff, S., Henikoff, J.G.: Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences 89(22), 10915–10919 (1992)

    Article  Google Scholar 

  14. Herndon, N., Caragea, D.: Naïve Bayes Domain Adaptation for Biological Sequences. In: Proceedings of the 4th International Conference on Bioinformatics Models, Methods and Algorithms, BIOINFORMATICS 2013, pp. 62–70 (2013)

    Google Scholar 

  15. Jia, C., Carson, M., Yu, J.: A fast weak motif-finding algorithm based on community detection in graphs. BMC Bioinformatics 14(1), 1–14 (2013)

    Article  Google Scholar 

  16. Tangirala, K., Caragea, D.: Community detection-based features for sequence classification. In: Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (BCB 2014). ACM (2014)

    Google Scholar 

  17. Largeron, C., Moulin, C., Géry, M.: Entropy based feature selection for text categorization. In: Proc. of the 2011 ACM Symp. on Applied Computing, SAC 2011, pp. 924–928 (2011)

    Google Scholar 

  18. Dongfang, N., Xiaolong, Z.: Prediction of hot regions in protein-protein interactions based on complex network and community detection. In: IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 17–23 (December 2013)

    Google Scholar 

  19. Mahmoud, H., Masulli, F., Rovetta, S., Russo, G.: Community detection in protein-protein interaction networks using spectral and graph approaches. In: Formenti, E., Tagliaferri, R., Wit, E. (eds.) CIBB 2013. LNCS, vol. 8452, pp. 62–75. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

  20. Mallek, S., Boukhris, I., Elouedi, Z.: Predicting proteins functional family: A graph-based similarity derived from community detection. In: Filev, D., Jabłkowski, J., Kacprzyk, J., Krawczak, M., Popchev, I., Rutkowski, L. (eds.) Intelligent Systems’2014. AISC, vol. 323, pp. 629–639. Springer, Heidelberg (2015)

    Google Scholar 

  21. van Laarhoven, T., Marchiori, E.: Robust community detection methods with resolution parameter for complex detection in protein protein interaction networks. In: Shibuya, T., Kashima, H., Sese, J., Ahmad, S. (eds.) PRIB 2012. LNCS, vol. 7632, pp. 1–13. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  22. Newman, M.E.J., Girvan, M.: Finding and evaluating community structure in networks. Phys. Rev. E 69(026113) (2004)

    Google Scholar 

  23. Blondel, V., Guillaume, J., Lambiotte, R., Mech, E.: Fast unfolding of communities in large networks. J. Stat. Mech, P10008 (2008)

    Google Scholar 

  24. Donetti, L., Muñoz, M.A.: Improved spectral algorithm for the detection of network communities. In: Proceedings of the 8th Granada Seminar - Computational and Statistical Physics, pp. 1–2 (2005)

    Google Scholar 

  25. Raghavan, U.N., Albert, R., Kumara, S.: Near linear time algorithm to detect community structures in large-scale networks. Physical Review E 76(3) (September 2007)

    Google Scholar 

  26. Rosvall, M., Bergstrom, C.T.: Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences 105(4), 1118–1123 (2008)

    Article  Google Scholar 

  27. Radicchi, F., Castellano, C., Cecconi, F., Loreto, V., Parisi, D.: Defining and identifying communities in networks. Proceedings of the National Academy of Sciences of the United States of America 101(9), 2658–2663 (2004)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Karthik Tangirala .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Tangirala, K., Herndon, N., Caragea, D. (2015). Community Detection-Based Feature Construction for Protein Sequence Classification. In: Harrison, R., Li, Y., Măndoiu, I. (eds) Bioinformatics Research and Applications. ISBRA 2015. Lecture Notes in Computer Science(), vol 9096. Springer, Cham. https://doi.org/10.1007/978-3-319-19048-8_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-19048-8_28

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-19047-1

  • Online ISBN: 978-3-319-19048-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics