Abstract
Considering the characteristics of biological sequence databases, which typically have a small alphabet, a very long length and a relative small size (several hundreds of sequences), we propose a new sequence mining algorithm (gIL). gIL was developed for linear sequence pattern mining and results from the combination of some of the most efficient techniques used in sequence and itemset mining. The algorithm exhibits a high adaptability, yielding a smooth and direct introduction of various types of features into the mining process, namely the extraction of rigid and arbitrary gap patterns. Both breadth or a depth first traversal are possible. The experimental evaluation, in synthetic and real life protein databases, has shown that our algorithm has superior performance to state-of-the art algorithms. The use of constraints has also proved to be a very useful tool to specify user interesting patterns.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Ayres, J., Flannick, J., Gehrke, J., Yiu, T.: Sequential pattern mining using a bitmap representation. In: Proceedings of the 8th SIGKDD International Conference on KDD and Data Mining (2002)
IBM Bioinformatics. Teiresias, http://www.research.ibm.com/bioinformatics/
Cuff, J., Barton, P.J.: Evaluation and improvement of multiple sequence methods for protein secondary structure prediction. In: PROTEINS: Structure, Function, and Genetics, vol. 34, Wiley-Liss, Inc., Chichester (1999)
Fimi. Fimi workshop (mushroom dataset) (2003), http://fimi.cs.helsinki.fi/fimi03
GenBank. yeast (saccharomyces cerevisiae), http://www.maths.uq.edu.au
Floratos, A., Rigoutsos, I.: Combinatorial pattern discovery in biological sequences: the teiresias algorithm. Bioinformatics 1(14) (January 1998)
Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., Hsu, M.-C.: PrefixSpan: Mining sequential patterns efficiently by prefix projected pattern growth. In: Proceedings of the International Conference on Data Engineering, ICDE 2001 (2001)
Psort. Psort database, http://www.psort.org/
Srikant, R., Agrawal, R.: Mining sequential patterns: Generalizations and performance improvements. In: Proceedings 5th International Conference on Extending DataBase Technology (1996)
Zaki, M.J.: Sequence mining in categorical domains: Incorporating constraints. In: In Proceedings of 9th International Conference on Information and Knowledge Management, CIKM 2000 (2000)
Zaki, M.J.: Spade: An efficient algorithm for mining frequent sequences. Machine Learning 42(1-2), 31–60 (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ferreira, P.G., Azevedo, P.J. (2005). Protein Sequence Pattern Mining with Constraints. In: Jorge, A.M., Torgo, L., Brazdil, P., Camacho, R., Gama, J. (eds) Knowledge Discovery in Databases: PKDD 2005. PKDD 2005. Lecture Notes in Computer Science(), vol 3721. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11564126_14
Download citation
DOI: https://doi.org/10.1007/11564126_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29244-9
Online ISBN: 978-3-540-31665-7
eBook Packages: Computer ScienceComputer Science (R0)