Various Features with Integrated Strategies for Protein Name Classification

  • Budi Taruna Ongkowijaya
  • Shilin Ding
  • Xiaoyan Zhu
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3759)


Classification task is an integral part of named entity recognition system to classify a recognized named entity to its corresponding class. This task has not received much attention in the biomedical domain, due to the lack of awareness to differentiate feature sources and strategies in previous studies. In this research, we analyze different sources and strategies of protein name classification, and developed integrated strategies that incorporate advantages from rule-based, dictionary-based and statistical-based method. In rule-based method, terms and knowledge of protein nomenclature that provide strong cue for protein name are used. In dictionary-based method, a set of rules for curating protein name dictionary are used. These terms and dictionaries are combined with our developed features into a statistical-based classifier. Our developed features are comprised of word shape features and unigram & bi-gram features. Our various information sources and integrated strategies are able to achieve state-of-the-art performance to classify protein and non-protein names.


Classification Task Integrate Strategy Regular Expression External Information Name Entity Recognition 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Fukuda, K., Tsunoda, T., Tamura, A., Takagi, T.: Toward Information Extraction: Identifying Protein Names from Biological Papers. In: Proceedings of the 3rd Pacific Symposium on Biocomputing (PSB 1998), vol. 3, pp. 705–716 (1998)Google Scholar
  2. 2.
    Narayanaswamy, M., Ravikumar, K.E., Vijay Shanker, K.: A Biological Named Entity Recognizer. In: Proc. of PSB 2003 (2003)Google Scholar
  3. 3.
    Egorov, S., Yuryev, A., Daraselia, N.: A Simple and Practical Dictionary-based Approach for Identification of Protein in Medline Abstracts. In: American Medical Informatics Association) (2004)Google Scholar
  4. 4.
    Mika, S., Rost, B.: Protein names precisely peeled off free textGoogle Scholar
  5. 5.
    Zhou, G., Zhang, J., Su, J., Shen, D., Tan, C.: Recognizing names in biomedical texts: a machine learning approach. Bioinformatics 20(7), 1178–1190 (2004)CrossRefGoogle Scholar
  6. 6.
    Lee, K.-J., Hwang, Y.-S., Rim, H.-C.: Two-Phase Biomedical NE Recognition based on SVMs. In: Proceeding of the ACL 2003 Workshop on Natural Language Processing in Biomedicine, pp. 33–40 (2003)Google Scholar
  7. 7.
    Torii, M., Kamboj, S., Vijay-Shanker, K.: Using name-internal and contextual features to classify biological termsGoogle Scholar
  8. 8.
    Mukherhea, S., et al.: Enhancing a biomedical information extraction with dictionary mining and context disambiguation. IBM J. RES. & DEV. 48(5/6)Google Scholar
  9. 9.
    Zadeh, L.A.: Fuzzy sets. Inform. Contr. 8, 574–591 (1965)CrossRefMathSciNetGoogle Scholar
  10. 10.
    Nabota, C., Collier, N., Tsujii, J.: Automatic term identification and classification in biology text. In: Proc. Natural Language Pacific Rim Symposium, pp. 369–374 (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Budi Taruna Ongkowijaya
    • 1
  • Shilin Ding
    • 1
  • Xiaoyan Zhu
    • 1
  1. 1.State Key Laboratory of Intelligent Technology and Systems (LITS), Department of Computer Science and TechnologyTsinghua UniversityBeijingChina

Personalised recommendations