Abstract
Approximate searching on the primary structure (i.e., amino acid arrangement) of protein sequences is an essential part in predicting the functions and evolutionary histories of proteins. However, because proteins distant in an evolutionary history do not conserve amino acid residue arrangements, approximate searching on proteins’ secondary structure is quite important in finding out distant homology. In this paper, we propose an indexing scheme for efficient approximate searching on the secondary structure of protein sequences which can be easily implemented in RDBMS. Exploiting the concept of clustering and lookahead, the proposed indexing scheme processes three types of secondary structure queries (i.e., exact match, range match, and wildcard match) very quickly. To evaluate the performance of the proposed method, we conducted extensive experiments using a set of actual protein sequences. According to the experimental results, the proposed method was proved to be faster than the existing indexing methods up to 6.3 times in exact match, 3.3 times in range match, and 1.5 times in wildcard match, respectively.
This work was supported by the Korea Research Foundation Grant. (KRF-2004-003-D00302)
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Alberts, B., Bray, D., Lweis, J., Raff, M., Roberts, K., Watson, J.D.: Molecular Biology of the Cell, 3rd edn. Garland Publishing Inc. (1994)
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs. Nucleic Acids Research 25(17) (1997)
Aung, Z., Fu, W., Tan, K.-L.: An Efficient Index-based Protein Structure Database Searching Method. In: Proc. IEEE DASFAA Conf. (2003)
Baxevanis, A.D., Ouellette, B.F.F.: BIOINFORMATICS: A Practical Guide to the Analysis of Genes and Proteins, 2nd edn. Wiley Interscience, Hoboken (2001)
Camoglu, O., Kahveci, T., Singh, A.K.: Towards Index-based Similarity Search for Protein Structure Databases. In: Proc. IEEE Computer Society Bioinformatics Conf., pp. 148–158 (2003)
Eidhammer, I., Jonassen, I.: Protein Structure Comparison and Structure Patterns - An Algorithmic Approach. ISMB tutorial (2001)
Fondrat, C., Dessen, P.: A Rapid Access Motif Database(RAMdb) with a Searching Algorithm for the Retrieval Patterns in Nucleic Acids or Protein Databanks. Computer Applications in the Bioscience 11(3), 273–279 (1995)
Frishman, D., Argos, P.: Seventy-five Accuracy in Protein Secondary Structure Prediction. Proteins 27(3), 329–335 (1997)
Frishman, D., Argos, P.: Incorporation of Long-Distance Interactions into a Secondary Structure Prediction Algorithm. Protein Engineering 9(2), 133–142 (1996)
Gibrat, J.F., Madel, T., Bryant, S.H.: Surprising Similarities in Structure Comparison. Current Opinion in Structural Biology 6(3), 377–385 (1996)
Hammel, L., Patel, J.M.: Searching on the Secondary Structure of Protein Sequence. In: Proc. VLDB Conf. (2002)
Holm, L., Sander, C.: Protein Structure Comparison by Alignment of Distance Matrices. J. Molecular Biology 233(1), 123–138 (1993)
Hunt, E., Atkinson, M.P., Irving, R.W.: Database Indexing for Large DNA and Protein Sequence Collections. VLDB Journal 11(3), 256–271 (2002)
Koehl, P.: Protein Structure Similarities. Current Opinion in Structural Biology 11(3), 348–353 (2001)
Mount, D.W.: Bioinformatics. Cold Spring Harbor Laboratory Press (2000)
Stephen, G.A.: String Searching Algorithms. World Scientific Publishing, Singapore (1994)
Wang, H., Perng, C.-S., Fan, W., Park, S., Yu, P.S.: Indexing Weighted Sequences in Large Databases. In: Proc. IEEE ICDE Conf., pp. 63–74 (2003)
Williams, H.E.: Genomic Information Retrieval. In: Proc. Australasian Database Conf., pp. 27–35 (2003)
Wu, C.H., Yeh, L.-S.L., Huang, H., Arminski, L., Castro-Alvear, J., Chen, Y., Hu, Z., Kourtesis, P., Ledley, R.S., Suzek, B.E., Vinayaka, C.R., Zhang, J., Barker, W.C.: The Protein Information Resource. Nucleic Acids Research 31(1), 345–347 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Seo, M., Park, S., Won, JI. (2005). CSI: Clustered Segment Indexing for Efficient Approximate Searching on the Secondary Structure of Protein Sequences. In: Hacid, MS., Murray, N.V., RaĹ›, Z.W., Tsumoto, S. (eds) Foundations of Intelligent Systems. ISMIS 2005. Lecture Notes in Computer Science(), vol 3488. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11425274_25
Download citation
DOI: https://doi.org/10.1007/11425274_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25878-0
Online ISBN: 978-3-540-31949-8
eBook Packages: Computer ScienceComputer Science (R0)