Empirical Textual Mining to Protein Entities Recognition from PubMed Corpus

  • Tyne Liang
  • Ping-Ke Shih
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3513)


Named Entity Recognition (NER) from biomedical literature is crucial in biomedical knowledge base automation. In this paper, both empirical rule and statistical approaches to protein entity recognition are presented and investigated on a general corpus GENIA 3.02p and a new domain-specific corpus SRC. Experimental results show the rules derived from SRC are useful though they are simpler and more general than the one used by other rule-based approaches. Meanwhile, a concise HMM-based model with rich set of features is presented and proved to be robust and competitive while comparing it to other successful hybrid models. Besides, the resolution of coordination variants common in entities recognition is addressed. By applying heuristic rules and clustering strategy, the presented resolver is proved to be feasible.


Hide Markov Model Regular Expression Name Entity Recognition Entity Recognition Coordination Variant 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Fukuda, K., Tsunoda, T., Tamura, A., Takagi, T.: Towards Information Extraction: identifying Protein Names from Biological Papers. In: The 3rd Pacific Symposium on Biocomputing, pp. 707–718 (1998)Google Scholar
  2. 2.
    Hou, W.J., Chen, H.H.: Enhancing Performance of Protein Name Recognizers using Collocation. In: ACL 2003, pp. 25–32 (2003)Google Scholar
  3. 3.
    Lee, K.J., Hwang, Y.S., Rim, H.C.: Two-Phase Biomedical NE Recognition based on SVMs. In: ACL 2003 Workshop on Natural Language Processing in Biomedicine, pp. 33–40 (2003)Google Scholar
  4. 4.
    Lin, Y., Tsai, T., Chiou, W., Wu, K., Sung, T.-Y., Hsu, W.L.: A Maximum Entropy Approach to Biomedical Named Entity Recognition. In: 4th Workshop on Data Mining in Bioinformatics (2004)Google Scholar
  5. 5.
    Olsson, F., Eriksson, G., Franzen, K., Asker, L., Liden, P.: Notions of Correctness when Evaluating Protein Name Taggers. In: 19th International Conference on Computational Linguistics, pp. 765–771 (2002)Google Scholar
  6. 6.
    Settles, B.: Biomedical Named Entity Recognition Using Conditional Random Fields and Rich Feature Sets. In: Int’l Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA), Geneva, Switzerland (2004)Google Scholar
  7. 7.
    Takeuchi, K., Collier, N.: Bio-Medical Entity Extraction using Support Vector Machines. In: ACL 2003 Workshop on Natural Language Processing in Biomedicine, pp. 57–64 (2003)Google Scholar
  8. 8.
    Tsuruoka, Y., Tsujii, J.: Boosting Precision and Recall of Dictionary-based Protein Name Recognition. In: ACL 2003 Workshop on Natural Language Processing in Biomedicine, pp. 41–48 (2003)Google Scholar
  9. 9.
    Zhou, G.D., Su, J.: Named Entity Recognition using an HMM-based Chunk Tagger. In: 40th Annual Meeting of the Association for Computational Linguistics (2002)Google Scholar
  10. 10.
    Zhou, G., Zhang, J., Su, J., Shen, D., Tan, C.L.: Recognizing Names in Biomedical Texts: A Machine Learning Approach. Bioinformatics 20, 1178–1190 (2004)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Tyne Liang
    • 1
  • Ping-Ke Shih
    • 1
  1. 1.Department of Computer and Information ScienceNational Chiao Tung UniversityHsinchuTaiwan

Personalised recommendations