Abstract
Due to the rapidly increasing amount of biomedical literature, automatic processing of biomedical papers is extremely important. Named Entity Recognition (NER) in this type of writing has several difficulties. In this paper we present a system to find phenotype names in biomedical literature. The system is based on Metamap and makes use of the UMLS Metathesaurus and the Human Phenotype Ontology. From an initial basic system that uses only these preexisting tools, five rules that capture stylistic and linguistic properties of this type of literature are proposed to enhance the performance of our NER tool. The tool is tested on a small corpus and the results (precision 97.6% and recall 88.3%) demonstrate its performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Leroy, G., Chen, H., Martinez, J.D.: A shallow parser based on closed-class words to capture relations in biomedical text. Journal of Biomedical Informatics 36(3), 145–158 (2003)
He, X., DiMarco, C.: Using lexical chaining to rank protein-protein interactions in biomedical texts. In: BioLink 2005: Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics, Conference of the Association for Computational Linguistics (2005) (poster Presentation)
Fundel, K., Küffner, R., Zimmer, R.: Relex - relation extraction using dependency parse trees. Bioinformatics 23(3), 365–371 (2007)
Ng, S.K., Wong, M.: Toward routine automatic pathway discovery from on-line scientific text abstracts. Genome Informatics 10, 104–112 (1999)
Yu, H., Zhu, X., Huang, M., Li, M.: Discovering patterns to extract protein-protein interactions from the literature: Part ii. Bioinformatics 21(15), 3294–3300 (2005)
Swanson, D.R.: Fish oil, Raynauds syndrome, and undiscovered public knowledge. Perspectives in Biology and Medicine 30(1), 7–18 (1986)
Hristovski, D., Peterlin, B., Mitchell, J.A., Humphrey, S.M.: Using literature-based discovery to identify disease candidate genes. I. J. Medical Informatics 74(2-4), 289–298 (2005)
Hristovski, D., Friedman, C., Rindflesch, T.C., Peterlin, B.: Exploiting semantic relations for literature-based discovery. In: AMIA Annual Symposium Proceedings, pp. 349–353 (2006)
Rindflesch, T.C., Tanabe, L., Weinstein, J.N., Hunter, L.: Edgar: Extraction of drugs, genes and relations from the biomedical literature. In: Pacific Symposium on Biocomputing, vol. 5, pp. 514–525 (2000)
Friedman, C., Kra, P., Yu, H., Krauthammer, M., Rzhetsky, A.: GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics (Oxford, England) 17(suppl. 1), S74–S82 (2001)
Tanabe, L., Scherf, U., Smith, L.H., Lee, J.K., Hunter, L., Weinstein, J.N.: MedMiner: an Internet text-mining tool for biomedical information, with application to gene expression profiling. BioTechniques 27(6) (1999)
Humphreys, K., Demetriou, G., Gaizauskas, R.: Two applications of information extraction to biological science journal articles: enzyme interactions and protein structures. In: Pacific Symposium on Biocomputing, pp. 505–516 (2000)
Gaizauskas, R., Demetriou, G., Artymiuk, P.J., Willett, P.: Protein structures and information extraction from biological texts: the PASTA system. Bioinformatics 19(1), 135–143 (2003)
Andrade, M.A., Valencia, A.: Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families. Bioinformatics 14(7), 600–607 (1998)
Valencia, A.: Automatic annotation of protein function. Current Opinion in Structural Biology 15(3), 267–274 (2005)
Leser, U., Hakenberg, J.: What makes a gene name? named entity recognition in the biomedical literature. Briefings in Bioinformatics 6(4), 357–369 (2005)
Aronson, A.R.: Effective mapping of biomedical text to the UMLS metathesaurus: the MetaMap program. In: AMIA Annual Symposium Proceedings, pp. 17–21 (2001)
Dai, M., Shah, N.H., Xuan, W., Musen, M.A., Watson, S.J., Athey, B.D., Meng, F.: An efficient solution for mapping free text to ontology terms. In: AMIA Summit on Translational Bioinformatics, San Francisco, CA (2008)
Krauthammer, M., Rzhetsky, A., Morozov, P., Friedman, C.: Using BLAST for identifying gene and protein names in journal articles. Gene 259(1-2), 245–252 (2000)
Xu, R., Supekar, K., Morgan, A., Das, A., Garber, A.: Unsupervised method for automatic construction of a disease dictionary from a large free text collection. In: AMIA Annual Symposium Proceedings, pp. 820–824 (2008)
Segura-Bedmar, I., Martnez, P., Segura-Bedmarr, M.: Drug name recognition and classification in biomedical texts: A case study outlining approaches underpinning automated systems. Drug Discovery Today 13(17-18), 816–823 (2008)
Horn, F., Lau, A.L., Cohen, F.E.: Automated extraction of mutation data from the literature: application of MuteXt to G protein-coupled receptors and nuclear hormone receptors. Bioinformatics 20(4), 557–568 (2004)
Fukuda, K., Tamura, A., Tsunoda, T., Takagi, T.: Toward information extraction: identifying protein names from biological papers. In: Pacific Symposium Biocomputing, pp. 707–718 (1998)
Nobata, C., Collier, N., Tsujii, J.: Automatic term identification and classification in biology texts. In: The 5th NLPRS Proceeding, pp. 369–374 (1999)
Strachan, T., Read, A.: Human Molecular Genetics, 3rd edn. Garland Science/Taylor & Francis Group (2003)
Humphreys, B.L., Lindberg, D.A., Schoolman, H.M., Barnett, G.O.: The Unified Medical Language System: an informatics research collaboration. J. Am. Med. Inform. Assoc. 5(1), 1–11 (1998)
Robinson, P.N., Mundlos, S.: The human phenotype ontology. Clinical Genetics 77(6), 525–534 (2010)
McKusick, V.: Mendelian Inheritance in Man and Its Online Version, OMIM. The American Journal of Human Genetics 80(4), 588–604 (2007)
Jurafsky, D., Martin, J.H.: Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition, 2nd edn. Prentice Hall, Englewood Cliffs (2008)
Shatkay, H., Feldman, R.: Mining the biomedical literature in the genomic era: an overview. J. Comput. Biol. 10(6), 821–855 (2003)
McCray, A.T., Burgun, A., Bodenreider, O.: Aggregating UMLS Semantic Types for Reducing Conceptual Complexity. Proceedings of Medinfo. 10(pt 1), 216–220 (2001)
Day-Richter, J., Harris, M.A., Haendel, M., Obo, T.G.O., Lewis, S.: OBO-Edit an ontology editor for biologists. Bioinformatics 23(16), 2198–2200 (2007)
Burgun, A., Mougin, F., Bodenreider, O.: Two approaches to integrating phenotype and clinical information. In: AMIA Annual Symposium Proceedings, pp. 75–79 (2009)
Smith, C., Goldsmith, C.A., Eppig, J.: The Mammalian Phenotype Ontology as a tool for annotating, analyzing and comparing phenotypic information. Genome Biology 6(1), R7+ (2004)
Schwartz, A.S., Hearst, M.A.: A simple algorithm for identifying abbreviation definitions in biomedical text. In: Pacific Symposium on Biocomputing, pp. 451–462 (2003)
Chen, L., Friedman, C.: Extracting phenotypic information from the literature via natural language processing. Medinfo. 11(Pt 2), 758–762 (2004)
Friedman, C., Alderson, P.O., Austin, J.H., Cimino, J.J., Johnson, S.B.: A general natural-language text processor for clinical radiology. Journal of the American Medical Informatics Association 1(2), 161–174 (1994)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Khordad, M., Mercer, R.E., Rogan, P. (2011). Improving Phenotype Name Recognition. In: Butz, C., Lingras, P. (eds) Advances in Artificial Intelligence. Canadian AI 2011. Lecture Notes in Computer Science(), vol 6657. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21043-3_30
Download citation
DOI: https://doi.org/10.1007/978-3-642-21043-3_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-21042-6
Online ISBN: 978-3-642-21043-3
eBook Packages: Computer ScienceComputer Science (R0)