Incremental Maintenance of Biological Databases Using Association Rule Mining

  • Kai-Tak Lam
  • Judice L. Y. Koh
  • Bharadwaj Veeravalli
  • Vladimir Brusic
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4146)


Biological research frequently requires specialist databases to support in-depth analysis about specific subjects. With the rapid growth of biological sequences in public domain data sources, it is difficult to keep these databases current with the sources. Simple queries formulated to retrieve relevant sequences typically return a large number of false matches and thus demanding manual filtration. In this paper, we propose a novel methodology that can support automatic incremental updating of specialist databases. Complex queries for incremental updating of relevant sequences are learned using Association Rule Mining (ARM), resulting in a significant reduction in false positive matches. This is the first time ARM is used in formulating descriptive queries for the purpose of incremental maintenance of specialised biological databases. We have implemented and tested our methodology on two real-world databases. Our experiments conclusively show that the methodology guarantees an F-score of up to 80% in detecting new sequences for these two databases.


Frequent Itemsets Association Rule Mining Complex Query Specialist Database Original Query 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Siew, J.P., Khan, A.M., Tan, P.T., Koh, J.L., Seah, S.H., Koo, C.Y., Chai, S.C., Armugam, A., Brusic, V., Jeyaseelan, K.: Systematic analysis of snake neurotoxins functional classification using a data warehousing approach. Bioinformatics 20(18), 3466–3480 (2004)CrossRefGoogle Scholar
  2. 2.
    Wang, Z., Wang, G.: APD: the Antimicrobial Peptide Database. Nucleic Acids. Res. 32, 590–592 (2004)CrossRefGoogle Scholar
  3. 3.
    Szymanski, M., Barciszewski, J.: Aminoacyl-tRNA synthetases database Y2K. Nucleic Acids Res. 28, 326–328 (2000)CrossRefGoogle Scholar
  4. 4.
    Tan, P.T.J., Khan, A.M., Brusic, V.: Bioinformatics for venom and toxin sciences. Brief Bioinform. 1, 53–62 (2003)CrossRefGoogle Scholar
  5. 5.
    Gendel, S.M.: Sequence Databases for Assessing the Potential Allergenicity of Proteins Used in Transgenic Foods. Advances in Food and Nutrition Research 42, 63–92 (1998)CrossRefGoogle Scholar
  6. 6.
    Koh, J.L.Y., Krishnan, S.P.T., Seah, S.H., Tan, P.T.J., Khan, A.M., Lee, M.L., Brusic, V.: BioWare: A framework for bioinformatics data retrieval, annotation and publishing. In: SIGIR 2004 workshop on Search and Discovery in Bioinformatics, Sheffield, UK, July 29 (2004)Google Scholar
  7. 7.
    Agrawal, R., Imielinski, T., Swami, A.: Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on Management of data, Washington, D.C., United States, pp. 207–216 (1993)Google Scholar
  8. 8.
    Creighton, C., Hanash, S.: Mining gene expression databases for association rules. Bioinformatics 19(1), 79–86 (2003)CrossRefGoogle Scholar
  9. 9.
    Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: The International Conference on Very Large Databases, pp. 487–499 (1994)Google Scholar
  10. 10.
    Borgelt, C., Kruse, R.: Induction of Association Rules: Apriori Implementation. In: 15th Conference on Computational Statistics. Physica Verlag, Heidelberg (2002)Google Scholar
  11. 11.
    Ananiadou, S., Friedman, C., Tsujii, J.: Introduction: named entity recognition in biomedicine. Journal of Biomedical Informatics 37, 393–395 (2004)CrossRefGoogle Scholar
  12. 12.
    Zhou, G.D., Zhang, J., Su, J., Shen, D., Tan, C.L.: Recognizing Names in Biomedical Texts: a Machine Learning Approach. Bioinformatics 20(7), 1178–1190 (2004)CrossRefGoogle Scholar
  13. 13.
    Settles, B.: ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics 21(14), 3191–3192 (2005)CrossRefGoogle Scholar
  14. 14.
    Ohta, T., Tateisi, Y., Kim, J., Mima, H., Tsujii, J.: The GENIA corpus: an annotated research abstract corpus in molecular biology domain. In: Proceedings of Human Language Technology (HLT 2002), San Diego, pp. 489–493 (2002)Google Scholar
  15. 15.
    Kim, J., Ohta, T., Tsuruoka, Y., Tateisi, Y., Collier, N.: Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA), Geneva, Switzerland, pp. 70–75 (2004)Google Scholar
  16. 16.
    Yeh, A., Hirschman, L., Morgan, A., Colosimo, M.: BioCreAtIve Task 1A: gene mention finding evaluation. BMC Bioinformatics 6(suppl. 1), S2 (2005)CrossRefGoogle Scholar
  17. 17.
    Bailey, T.L., Elkan, C.: The Value of Prior Knowledge in Discovering Motifs with MEME. ISMB 3, 21–29 (1995)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Kai-Tak Lam
    • 1
  • Judice L. Y. Koh
    • 2
    • 3
  • Bharadwaj Veeravalli
    • 1
  • Vladimir Brusic
    • 4
  1. 1.Department of Electrical & Computer EngineeringNational University of SingaporeSingapore
  2. 2.Institute for Infocomm ResearchSingapore
  3. 3.School of ComputingNational University of SingaporeSingapore
  4. 4.Australian Centre for Plant Functional Genomics, School of Land and Food Sciences, and the Institute for Molecular BioscienceUniversity of QueenslandBrisbaneAustralia

Personalised recommendations