Protein Sequence Pattern Mining with Constraints

Ferreira, Pedro Gabriel; Azevedo, Paulo J.

doi:10.1007/11564126_14

Pedro Gabriel Ferreira²³ &
Paulo J. Azevedo²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3721))

Included in the following conference series:

European Conference on Principles of Data Mining and Knowledge Discovery

2907 Accesses
12 Citations

Abstract

Considering the characteristics of biological sequence databases, which typically have a small alphabet, a very long length and a relative small size (several hundreds of sequences), we propose a new sequence mining algorithm (gIL). gIL was developed for linear sequence pattern mining and results from the combination of some of the most efficient techniques used in sequence and itemset mining. The algorithm exhibits a high adaptability, yielding a smooth and direct introduction of various types of features into the mining process, namely the extraction of rigid and arbitrary gap patterns. Both breadth or a depth first traversal are possible. The experimental evaluation, in synthetic and real life protein databases, has shown that our algorithm has superior performance to state-of-the art algorithms. The use of constraints has also proved to be a very useful tool to specify user interesting patterns.

Download to read the full chapter text

Chapter PDF

schematikon: Detailed Sequence-Structure Relationships from Mining a Non-redundant Protein Structure Database

Applications of Concurrent Sequential Patterns in Protein Data Mining

Constraint-Based Sequence Mining Using Constraint Programming

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Ayres, J., Flannick, J., Gehrke, J., Yiu, T.: Sequential pattern mining using a bitmap representation. In: Proceedings of the 8th SIGKDD International Conference on KDD and Data Mining (2002)
Google Scholar
IBM Bioinformatics. Teiresias, http://www.research.ibm.com/bioinformatics/
Cuff, J., Barton, P.J.: Evaluation and improvement of multiple sequence methods for protein secondary structure prediction. In: PROTEINS: Structure, Function, and Genetics, vol. 34, Wiley-Liss, Inc., Chichester (1999)
Google Scholar
Fimi. Fimi workshop (mushroom dataset) (2003), http://fimi.cs.helsinki.fi/fimi03
GenBank. yeast (saccharomyces cerevisiae), http://www.maths.uq.edu.au
Floratos, A., Rigoutsos, I.: Combinatorial pattern discovery in biological sequences: the teiresias algorithm. Bioinformatics 1(14) (January 1998)
Google Scholar
Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., Hsu, M.-C.: PrefixSpan: Mining sequential patterns efficiently by prefix projected pattern growth. In: Proceedings of the International Conference on Data Engineering, ICDE 2001 (2001)
Google Scholar
Psort. Psort database, http://www.psort.org/
Srikant, R., Agrawal, R.: Mining sequential patterns: Generalizations and performance improvements. In: Proceedings 5th International Conference on Extending DataBase Technology (1996)
Google Scholar
Zaki, M.J.: Sequence mining in categorical domains: Incorporating constraints. In: In Proceedings of 9th International Conference on Information and Knowledge Management, CIKM 2000 (2000)
Google Scholar
Zaki, M.J.: Spade: An efficient algorithm for mining frequent sequences. Machine Learning 42(1-2), 31–60 (2001)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics, University of Minho, Campus of Gualtar, 4710-057, Braga, Portugal
Pedro Gabriel Ferreira & Paulo J. Azevedo

Authors

Pedro Gabriel Ferreira
View author publications
You can also search for this author in PubMed Google Scholar
Paulo J. Azevedo
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

LIACC/FEP, Universidade do Porto, Portugal
Alípio Mário Jorge
LIAAD-INESC Porto LA / FEP, University of Porto, R. de Ceuta, 118, 6, 4050-190, Porto, Portugal
Luís Torgo
LIAAD-INESC Porto L.A./Faculty of Economics, University of Porto, Rua de Ceuta, 118-6, 4050-190, Porto, Portugal
Pavel Brazdil
Faculdade de Engenharia & LIAAD, Universidade do Porto, Portugal
Rui Camacho
Faculty of Economics of the University of Porto, Portugal
João Gama

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ferreira, P.G., Azevedo, P.J. (2005). Protein Sequence Pattern Mining with Constraints. In: Jorge, A.M., Torgo, L., Brazdil, P., Camacho, R., Gama, J. (eds) Knowledge Discovery in Databases: PKDD 2005. PKDD 2005. Lecture Notes in Computer Science(), vol 3721. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11564126_14

Download citation

DOI: https://doi.org/10.1007/11564126_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29244-9
Online ISBN: 978-3-540-31665-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Protein Sequence Pattern Mining with Constraints

Abstract

Chapter PDF

Similar content being viewed by others

schematikon: Detailed Sequence-Structure Relationships from Mining a Non-redundant Protein Structure Database

Applications of Concurrent Sequential Patterns in Protein Data Mining

Constraint-Based Sequence Mining Using Constraint Programming

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Protein Sequence Pattern Mining with Constraints

Abstract

Chapter PDF

Similar content being viewed by others

schematikon: Detailed Sequence-Structure Relationships from Mining a Non-redundant Protein Structure Database

Applications of Concurrent Sequential Patterns in Protein Data Mining

Constraint-Based Sequence Mining Using Constraint Programming

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation