Abstract
Automated extraction of information from biological literature promises to play an increasingly important role in text-based knowledge discovery processes. This is particularly true in regards to high throughput approaches such as microarrays and combining data from different sources in a systems biology approach. We have developed an integrated system that combines protein/gene name dictionaries, synonymy dictionaries, natural language processing, and pattern matching rules to extract and organize gene relationships from full text articles. In the first phase full text articles were collected from 20 peer-reviewed journals in the field of molecular biology and biomedicine over the last 5 years (1999-2003). The extracted relationships were organized in a database that included the unique PubMed ID and section id (abstract, introduction, materials and method, and results and discussion) to identify the source article and section from which concepts were extracted. The system architecture, its uniqueness and advantages are presented in this paper. It is hoped that the resulting knowledge base will assist in the understanding of gene lists generated from microarray experiments.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
National Library of Medicine’s bibliographic database at http://www.ncbi.nlm.nih.gov
Cowie, J., Lehnert, W.: Information extraction. Communications of the ACM 39, 80–91 (1996)
Fukuda, K., Tsunoda, T., Tamura, A., Takagi, T.: Towards Information Extraction: identifying protein names from biological papers. Pacific Symposium on Biocomputing, 707–718 (1998)
Eriksson, G., Franzen, K., Olsson, F.: Exploiting syntax when detecting protein names in text. In: Workshop on natural language processing in Biomedical Applications (2002)
Narayanaswamy, M., Ravikumar, K.E., Vijay-shankar, K.: A Biological Named Enitity Recognizer. Pacific Symposium on Biocomputing 8, 427–438 (2003)
Krauthammer, M., Rzhetsky, A., Morozov, P., Friedman, C.: Using blast for identifying gene and protein names in journal articles. Gene, 245–252 (2000)
Hanisch, D., Fluck, J., Mevissien, D.T., Zimmer, R.: Playing Biology’s Name Game: Identifying protein names in scientific text. Pacific Symposium on Biocomputing 8, 403–414 (2003)
Egorov, S., Yuryev, A., Daraselia, N.: A simple and practical dictionary based approach for identification of proteins in Medline abstracts. JAMIA 11(3), 174–178 (2004)
Hatzivassiloglou, V., Duboue, P.A., Rzhetsky, A.: Disambiguating proteins, genes, and RNA in text: a machine learning approach. In: Proceedings of the 9th International Conference on Intelligent Systems for Molecular Biology, pp. 97–106 (2001)
Wilbur, W., et al.: Analysis of biomedical text for biochemical names: As comparison of three methods. AMIA symposium, 176–180 (1999)
Collier, N., Nobata, C., Tsujii, T.: 2000. In: COLING conference proceedings, pp. 201–207 (2000)
Kazama, J., Makino, T., Ohta, Y., Tsujii, J.: Tuning Support Vector Machines for Biomedical Named Entity Recognition. In: Proceedings of the Natural Language Processing in the Biomedical Domain, Philadelphia, PA, USA (2002)
Chang, J.T., Schutze, H., Altman, R.B.: GAPSCORE: finding gene and protein names one word at a time. Bioinformatics 20, 216–225 (2004)
Tanabe, L., Wilbur, J.: Tagging gene and protein names in biomedical text. Bioinformatics 18, 1124–1132 (2002)
Ono, T., Hishigaki, H., Tanigami, A., Takagi, T.: Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics 17, 155–161 (2001)
Wong, L.: A protein interaction extraction system. Pacific Symposium on Biocomputing 6, 520–531 (2001)
Humphreys, K., Demetriou, G., Gaizauskas, R.: Two applications of information extraction to biological science journal articles: enzyme interactions and protein structure. Pacific Symposium on Biocomputing 5, 505–516 (2000)
Park, J.C., Kim, H.S., Kim, J.J.: Bi-directional incremental parsing for automatic pathway identification with combinatory categorical grammar. Pacific Symposium on Biocomputing 6, 396–407 (2001)
Pusteojovsky, J., Castano, J., Zhang, J., Kotecki, M., Cochran, B.: Robust relational parsing over biomedical literature: Extracting inhibits relations. Pacific Symposium on Biocomputing 7, 362–373 (2002)
Yakushiji, A., Tateisi, Y., Miyao, Y., Tsujii, J.: Event extraction from biomedical papers using a full parser. Pacific Symposium on Biocomputing 6, 408–419 (2001)
Sekimizu, T., Park, H.S., Tsujii, J.: Identifying the interaction between genes and gene products based on frequently seen verbs in Medline abstracts. In: Proceedings of the workshop on Genome Informatics, pp. 62–71 (1998)
Rindflesch, T., Tanabe, L., Weinstein, J., Hunter, L.: EDGAR: Extraction of drugs, genes and relations from the biomedical literature. Pacific Symposium on Biocomputing 5, 517–528 (2000)
Ng, S.-K., Wong, M.: Towards routine automatic pathway discovery from on-line scientific text abstracts. In: Proceedings of the workshop on Genome Informatics, vol. 10, pp. 104–112 (1999)
Rzhetsky, A., et al.: GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. Jr of Biomedical Informatics 37, 43–53 (2004)
Pustejovsky, J., et al.: Medstract: Creating large scale information servers for biomedical libraries. In: ACL 2002, Philadelphia (2002)
Wong, L.: PIES a protein interaction extraction system. Pacific Symposium on Biocomputing 6, 520–531 (2001)
Schena, M., Shalon, D., Davis, R.W., Brown, P.: Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, 467–470 (1995)
DeRisi, J., Iyer, V., Brown, P.: Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278, 680–686 (1997)
SPSS LexiQuest mine, available at http://www.spss.com
GetItRight, available at http://www.cthtech.com
LocusLink online gene database, available at http://www.ncbi.nlm.nih.gov/locuslink
Genecards online human gene databank available at http://bioinformatics.weizmann.ac.il/cards/
Swissprot senquence database, available at http://ca.expasy.org/sprot/
GoldenPath, Human Genome project, at http://www.cse.ucsc.edu/centers/cbe/Genome/
HUGO Human Genome Organization, at http://www.gene.ucl.ac.uk/hugo/
Chen, P.: The entity-relationship model: Toward a uniform view of data. ACM Transactions on Database systems 1(1), 9–36 (1976)
DIP online protein interaction database, available at http://dip.doe-mbi.ucla.edu/
KEGG: Kyoto Encyclopedia of Genes and Genomes, available at http://www.genome.ad.jp/kegg/
Baeza-Yates, R., Ribeiro-Nato, B.: Modern information retrieval. Addison-Wesley, Harlow (1999)
Mayanil, C.S., George, D., Freilich, L., Miljan, E.J., Mania-Farnell, B., McLone, D.G., Bremer, E.G.: Microarray analysis detects novel Pax3 downstream target genes. J Biol. Chem. 276(52), 49299–49309 (2001)
SPSS Clementine workbench, available at http://www.spss.com
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bremer, E.G., Natarajan, J., Zhang, Y., DeSesa, C., Hack, C.J., Dubitzky, W. (2004). Text Mining of Full Text Articles and Creation of a Knowledge Base for Analysis of Microarray Data. In: LĂłpez, J.A., Benfenati, E., Dubitzky, W. (eds) Knowledge Exploration in Life Science Informatics. KELSI 2004. Lecture Notes in Computer Science(), vol 3303. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30478-4_8
Download citation
DOI: https://doi.org/10.1007/978-3-540-30478-4_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23927-7
Online ISBN: 978-3-540-30478-4
eBook Packages: Springer Book Archive