Enriched Bag of Words for Protein Remote Homology Detection
One of the most challenging Pattern Recognition problems in Bioinformatics is to detect if two proteins that show very low sequence similarity are functionally or structurally related – this is the so-called Protein Remote Homology Detection (PRHD) problem. Even if in this context approaches based on the “Bag of Words” (BoW) paradigm showed high potential, there is still room for further refinements, especially by considering the peculiar application context. In this paper we proposed a modified BoW representation for PRHD, which enriches the classic BoW with information derived from the evolutionary history of mutations each protein is subjected to. An experimental comparison on a standard benchmark demonstrates the feasibility of the proposed technique.
KeywordsBag of words N-grams Sequence classification
- 4.Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: Workshop on Statistical Learning in Computer Vision, ECCV, pp. 1–22 (2004)Google Scholar
- 17.Marszaek, M., Schmid, C.: Spatial weighting for bag-of-features. In: Proceedings of International Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 2118–2125 (2006)Google Scholar
- 18.Pevsner, J.: Bioinformatics and Functional Genomics. Wiley, Hoboken (2003)Google Scholar