New Challenges for Biological Text-Mining in the Next Decade

Dai, Hong-Jie; Chang, Yen-Ching; Tzong-Han Tsai, Richard; Hsu, Wen-Lian

doi:10.1007/s11390-010-9313-5

New Challenges for Biological Text-Mining in the Next Decade

Regular Paper
Published: 20 January 2010

Volume 25, pages 169–179, (2010)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Hong-Jie Dai^1,2,
Yen-Ching Chang¹,
Richard Tzong-Han Tsai³ &
…
Wen-Lian Hsu^1,2

397 Accesses
34 Citations
Explore all metrics

Abstract

The massive flow of scholarly publications from traditional paper journals to online outlets has benefited biologists because of its ease to access. However, due to the sheer volume of available biological literature, researchers are finding it increasingly difficult to locate needed information. As a result, recent biology contests, notably JNLPBA and BioCreAtIvE, have focused on evaluating various methods in which the literature may be navigated. Among these methods, text-mining technology has shown the most promise. With recent advances in text-mining technology and the fact that publishers are now making the full texts of articles available in XML format, TMSs can be adapted to accelerate literature curation, maintain the integrity of information, and ensure proper linkage of data to other resources. Even so, several new challenges have emerged in relation to full text analysis, life-science terminology, complex relation extraction, and information fusion. These challenges must be overcome in order for text-mining to be more effective. In this paper, we identify the challenges, discuss how they might be overcome, and consider the resources that may be helpful in achieving that goal.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Supporting Biological Pathway Curation Through Text Mining

Text Mining in Bioinformatics

GPDminer: a tool for extracting named entities and analyzing relations in biological literature

Article Open access 06 March 2024

References

Kim J D et al. Introduction to the bio-entity recognition task at JNLPBA. In Proc. the International Workshop on Natural Language Processing in Biomedicine and Its Applications (JNLPBA2004), Geneva, Switzerland, Aug. 28–29, 2004, pp.70–75.
Hirschman L et al. Overview of BioCreAtIvE: Critical assessment of information extraction for biology. BMC Bioinformatics, 2005, 6(Suppl.1): S1.
Article Google Scholar
Krallinger M et al. Evaluation of text-mining systems for biology: Overview of the Second BioCreative community challenge. Genome Biology, 2008, 9(Suppl. 2): S1.
Article Google Scholar
Hearst M A. Untangling text data mining. In Proc. the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, College Park, USA, June 20–26, 1999, pp.3–10.
Hahn U et al. Text mining: Powering the database revolution. Nature, 2007, 448(7150): 130.
Article Google Scholar
Hearst M. What is text mining. 2003, http://people.ischool.berkeley.edu/∼hearst/text-mining.html.
Dai H J et al. BIOSMILE web search: A web application for annotating biomedical entities and relations. Nucl. Acids Res., 2008, 36(Web Sever Issue): W390–W398.
Article Google Scholar
Rebholz-Schuhmann D et al. Text processing through Web services: Calling Whatizit. Bioinformatics, 2008, 24(2): 296–298.
Article Google Scholar
Fernández J M et al. iHOP web services. Nucl. Acids Res., 2007, 35(Web Server Issue): W21–W26.
Article Google Scholar
Elsevier Article 2.0 Contest. http://article20.elsevier.com/contest/home.html, Accessed July, 2009.
The Elsevier Grand Challenge. http://www.elseviergrandchallenge.com/, Accessed November, 2009.
BioCreAtIvE II.5. http://www.biocreative.org/events/biocreative-ii5/biocreative-ii5/, Accessed December, 2009.
Ananiadou S, Chruszcz J et al. The national ventre for text mining: Aims and objectives. In Proc. UKKDD2007, Kent, UK, April 25, 2007, pp.6–12.
RSC Project Prospect. http://www.projectprospect.org/.
Seringhaus M, Gerstein M. Manually structured digital abstracts: A scaffold for automatic text mining. FEBS Letters, 2008, 582(8): 1170.
Article Google Scholar
Morgan A et al. Overview of BioCreative II gene normalization. Genome Biology, 2008, 9(Suppl. 2): S3.
Article Google Scholar
Gonzalez G et al. Mining gene-disease relationships from biomedical literature: Weighting protein-protein interactions and connectivity measures. In Proc. the Pacific Symposium on Biocomputing, 2007, 12: 28–29.
Article Google Scholar
Tsai R T H, Lai P et al. HypertenGene: Extracting key hypertension genes from biomedical literature with position and automatically-generated template features. BMC Bioinformatics, 2009, 10(Suppl. 5): S9.
Article Google Scholar
Cohen A M, Hersh W R. A survey of current work in biomedical text mining. Briefings in Bioinformatics, 2005, 6(1): 57–71.
Article Google Scholar
Smith L et al. Overview of BioCreative II gene mention recognition. Genome Biology, 2008, 9(Suppl.2): S2.
Article Google Scholar
Krallinger M et al. Overview of the protein-protein interaction annotation extraction task of BioCreative II. Genome Biology, 2008, 9(Suppl. 2): S4.
Article Google Scholar
Chinchor N. MUC-7 named entity task definition (Version 3.5). In Proc. the 7th Message Understanding Conference, 1997.
Leser U, Hakenberg J. What makes a gene name? Named entity recognition in the biomedical literature. Briefings in Bioinformatics, 2005, 6(4): 357–369.
Article Google Scholar
Erhardt R A A et al. Status of text-mining techniques applied to biomedical text. Drug Discovery Today, 2006, 11(7/8): 315–325.
Article Google Scholar
Liu H et al. A study of abbreviations in MEDLINE abstracts. In Proc. AMIA Annual Symposium, San Antonio, USA, Nov. 9–13, 2002, pp.464–468.
Tanabe L, Wilbur W J. Tagging gene and protein names in full text articles. In Proc. the ACL-02 Workshop on Natural Language Processing in the Biomedical Domain — Volume 3, Philadelphia, USA, July 11, 2002, pp.9–13.
Tanabe L, Wilbur W J. Tagging gene and protein names in biomedical text. Bioinformatics, 2002, 18(8): 1124–1132.
Article Google Scholar
Zhao S. Named entity recognition in biomedical texts using an HMM model. In Proc. the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications, Geneva, Switzerland, Aug. 28–29, 2004, pp.84–87.
Kazama J i et al. Tuning support vector machines for biomedical named entity recognition. In Proc. the ACL-02 Workshop on Natural Language Processing in the Biomedical Domain — Volume 3, Philadelphia, USA, July 11, 2002, pp.1–8.
Finkel J et al. Exploiting context for biomedical entity recognition: From syntax to the web. In Proc. the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications, Geneva, Switzerland, Aug. 28–29, 2004, pp.88–91.
Tsai R T H et al. NERBio: Using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition. BMC Bioinformatics, 2006, 7(Suppl. 5): S11.
Article Google Scholar
Si L et al. Boosting performance of bio-entity recognition by combining results from multiple systems. In Proc. the 5th International Workshop on Bioinformatics, Chicago, USA, Aug. 21, 2005, pp.76–83.
Altman R et al. Text mining for biology — The way forward: Opinions from leading scientists. Genome Biology, 2008, 9(Suppl. 2): S7.
Article Google Scholar
Jimeno A et al. Assessment of disease named entity recognition on a corpus of annotated sentences. BMC Bioinformatics, 2008, 9(Suppl. 3): S3.
Article Google Scholar
Yu H et al. Mapping abbreviations to full forms in biomedical articles. Journal of the American Medical Informatics Association, 2002, 9(3): 262–272.
Article Google Scholar
Schwartz A S, Hearst M A. A simple algorithm for identifying abbreviation definitions in biomedical text. Proc. Pac. Symp. Biocomput., 2003, 8: 451–462.
Google Scholar
Podowski R et al. Suregene, a scalable system for automated term disambiguation of gene and protein names. Journal of Bioinformatics and Computational Biology, 2005, 3(3): 743–770.
Article Google Scholar
Hirschman L et al. Overview of BioCreAtIvE task 1B: Normalized gene lists. BMC Bioinformatics, 2005, 6(Suppl. 1): S11.
Article Google Scholar
Cohen W, Minkov E. A graph-search framework for associating gene identifiers with documents. BMC Bioinformatics, 2006, 7: 440.
Article Google Scholar
Leitner F. Comparative community assessments for applied biomedical text mining: BioCreative II challenge and metaservices. In Intelligent Systems for Molecular Biology (ISMB) and European Conference on Computational Biology (ECCB), Highlights Track, Stockholm, Sweden, June 27-July 2, 2009.
Fundel K, Guttler D et al. A simple approach for protein name identification: Prospects and limits. BMC Bioinformatics, 2005, 6(Suppl. 1): S15.
Article Google Scholar
Hakenberg J et al. Me and my friends: Gene mention normalization with background knowledge. In Proc. the Second BioCreAtIvE Challenge Evaluation Workshop, Madrid, Spain, April 23–25, 2007, p.23–25.
Seki K, Javed M. Discovering implicit associations between genes and hereditary diseases. In Proc. Pac. Symp. Biocomput., 2007, 12: 316–327.
Article Google Scholar
Cooper J W, Kershenbaum A. Discovery of protein-protein interactions using a combination of linguistic, statistical and graphical information. BMC Bioinformatics, 2005, 6: 143.
Article Google Scholar
Shah P K et al. Information extraction from full text scientific articles: Where are the keywords? BMC Bioinformatics, 2003, 4: 20.
Article Google Scholar
Shatkay H et al. Integrating image data into biomedical text categorization. Bioinformatics, July 15, 2006, 22(14): e446–e453.
Article Google Scholar
Kou Z et al. A stacked graphical model for associating information from text and images in figures. In Proc. Pac. Symp. Biocomput., 2007, 12: 257–268.
Article MathSciNet Google Scholar
Saric J et al. Extraction of regulatory gene/protein networks from Medline. Bioinformatics, March 15, 2006, 22(6): 645–650.
Article MathSciNet Google Scholar
Ono T et al. Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics, Feb. 2001, 17(2): 155–161.
Article Google Scholar
Kim S et al. Kernel approaches for genic interaction extraction. Bioinformatics, 2008, 24(1): 118–126.
Article Google Scholar
Bunescu R, Mooney R. Subsequence kernels for relation extraction. Advances in Neural Information Processing Systems, 2006, 18: 171–178.
Google Scholar
Barnickel T et al. Large scale application of neural network based semantic role labeling for automated relation extraction from biomedical texts. PLoS One, 2009, 4(7): e6393.
Article Google Scholar
Ramani A et al. Consolidating the set of known human protein-protein interactions in preparation for large-scale mapping of the human interactome. Genome Biology, 2005, 6(5): R40.
Article MathSciNet Google Scholar
Bunescu R et al. Comparative experiments on learning information extractors for proteins and their interactions. Artificial Intelligence in Medicine, 2005, 33(2): 139–155.
Article Google Scholar
Rosario B, Hearst M A. Multi-way relation classification: Application to protein-protein interactions. In Proc. the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, Vancouver, Canada, Oct. 6–8, 2005, pp.732–739.
Craven M, Kumlien J. Constructing biological knowledge bases by extracting information from text sources. In Proc. the 7th International Conference on Intelligent Systems for Molecular Biology, Heidelberg, Germany, Aug. 6–10, 1999, pp.77–86.
Rindflesch T C et al. EDGAR: Extraction of drugs, genes and relations from the biomedical literature. In Proc. Pac. Symp. Biocomput., 2000, 5: 514–525.
Google Scholar
Chun H W et al. Extraction of gene-disease relations from Medline using domain dictionaries and machine learning. In Proc. the Pacific Symposium on Biocomputing, 2006, 11: 4–15.
Google Scholar
Tsai R T H et al. HypertenGene: Extracting key hypertension genes from biomedical literature with position and automatically-generated template features. To appear in BMC Bioinformatics, 2009.
Miyao Y, Sagae K et al. Evaluating contributions of natural language parsers to protein-protein interaction extraction. Bioinformatics, 2008, 25(3): 394–400.
Article Google Scholar
Wong L. PIES, a protein interaction extraction system. In Proc. Pacific Symposium on Biocomputing, 2001, 6: 520–531.
Google Scholar
Castaño J et al. Anaphora resolution in biomedical literature. In International Symposium on Reference Resolution for NLP, Alicante, Spain, June 3–4, 2002.
Pustejovsky J et al. Medstract: Creating large-scale information servers for biomedical libraries. In Proc. the ACL-02 Workshop on Natural Language Processing in the Biomedical Domain, Philadelphia, USA, July 11, 2002, pp.85–92.
Nguyen N et al. Challenges in pronoun resolution system for biomedical text. In Proc. the Sixth International Language Resources and Evaluation (LREC2008), Marrakech, Morocco, May 28–30, 2008.
Tsai R T H et al. PubMed-EX: A web browser extension to enhance PubMed search with text mining features. Bioinformatics, 2009, [Epub ahead of print].
Zhang Z et al. Bringing Web 2.0 to bioinformatics. Brief Bioinform., 2009, 10(1): 1–10.
Article MATH Google Scholar
Cheung K et al. Semantic Web Approach to Database Integration in the Life Sciences. Semantic Web: Revolutionizing Knowledge Discovery in the Life Sciences, Springer, 2007, pp.11–30.
Dowell R et al. The distributed annotation system. BMC Bioinformatics, 2001, 2: 7.
Article Google Scholar
O’Reilly T. What is Web 2.0: Design patterns and business models for the next generation of software. 2005, http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html.
Mons B et al. Calling on a million minds for community annotation in WikiProteins. Genome Biology, 2008, 9(5): R89.
Article Google Scholar
Baral C et al. CBioC: Beyond a prototype for collaborative annotation of molecular interactions from the literature. In Proc. Computational Systems Bioinformatics Conference, 2007, 6: 381–384.
Article Google Scholar
Oda K et al. New challenges for text mining: Mapping between text and manually curated pathways. BMC Bioinformatics, 2008, 9(Suppl. 3): S5.
Article MathSciNet Google Scholar
Kanehisa M et al. KEGG for linking genomes to life and the environment. Nucleic Acids Research, 2008, 36(Database Issue): D480–D484.
Google Scholar
Hirschman L, Blaschke C. Evaluation of Text Mining in Biology. Text Mining for Biology and Biomedicine, Artech House, 2005, pp.213–245.
Yeh A et al. Background and overview for KDD Cup 2002 task 1: Information extraction from biomedical articles. ACM SIGKDD Explorations Newsletter, 2002, 4(2): 87–89.
Article Google Scholar
Hersh W, Voorhees E. TREC genomics special issue overview. Information Retrieval, 2009, 12(1): 1–15.
Article Google Scholar
Hakenberg J, Plake C et al. LLL’05 challenge: Genic interaction extraction-identification of language patterns based on alignment and finite state automata. In Proc. the ICML05 Workshop: Learning Language in Logic (LLL05), 2005, 14: 38–45.
Kim J D et al. Overview of BioNLP’09 shared task on event extraction. In Proc. the BioNLP 2009 Workshop Companion Volume for Shared Task, Boulder, USA, June 4–5, 2009, pp.1–9.
Kim J D et al. Corpus annotation for mining biomedical events from literature. BMC Bioinformatics, 2008, 9: 10.
Article Google Scholar
Bader G et al. Pathguide: A pathway resource list. Nucleic Acids Research, 2006, 34(Database Issue): D504–D506.
Article Google Scholar
Camon E et al. The gene ontology annotation (GOA) database: Sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Research, 2004, 32(Database Issue): D262–D266.
Article Google Scholar
Kim J D et al. GENIA corpus—A semantically annotated corpus for bio-textmining. Bioinformatics, 2003, 19(Suppl. 1): 180–182.
Article Google Scholar
Tanabe L et al. GENETAG: A tagged corpus for gene/protein named entity recognition. BMC Bioinformatics, 2005, 6(Suppl. 1): S3.
Article Google Scholar
Heimonen J et al. Complex-to-pairwise mapping of biological relationships using a semantic network representation. In Proc. the Third International Symposium on Semantic Mining in Biomedicine (SMBM2008), Turku, Finland, Sept. 1–3, 2008, pp.45–52.
Rosario B, Hearst M A. Classifying semantic relations in bioscience texts. In Proc. the 42nd Annual Meeting on Association for Computational Linguistics, Barcelona, Spain, July 21–26, 2004, Article No. 43.
Berleant D et al. Corpus properties of protein interaction descriptions in MEDLINE. 2003, http://class.ee.iastate.edu/berleant/home/me/cv/papers/corpuspropertiesstart.htm.
Nedellec C. Learning language in logic-genic interaction extraction challenge. In Proc. the ICML05 Workshop: Learning Language in Logic (LLL05), Bonn, Germany, Aug. 7, 2005, pp.31–37.
Wattarujeekrit T et al. PASBio: Predicate-argument structures for event extraction in molecular biology. BMC Bioinformatics, Oct. 19, 2004, 5: 155.
Article Google Scholar
Chou W C et al. A semi-automatic method for annotating a biomedical proposition bank. In Proc. ACL Workshop on Frontiers in Linguistically Annotated Corpora, Sydney, Australia, July 22, 2006, pp.5–12.
Seth K et al. Integrated annotation for biomedical information extraction. In Proc. HLT/NAACL-2004, Boston, USA, May 2–7, 2004, pp.61–68.
Tateisi Y, Tsujii J. Part-of-speech annotation of biology research abstracts. In Proc. the 4th International Conference on Language Resource and Evaluation (LREC2004), Lisbon, Portugal, May 26–28, 2004, pp.1267–1270.
Tateisi Y et al. Syntax annotation for the GENIA corpus. In Proc. IJCNLP 2005, Companion Volume, Jeju Island, Korea, Oct. 11–13, 2005, pp.222–227.
Lease M, Charniak E. Parsing biomedical literature. In Proc. the Second International Joint Conference on Natural Language Processing, Jeju Island, Korea, Oct. 11–13, 2005, pp.58–69.
Smith L et al. MedPost: A part-of-speech tagger for BioMedical text. Bioinformatics, September 22, 2004, 20(14): 2320–2321.
Article Google Scholar
Krallinger M et al. The BioCreative II.5 challenge overview. In Proc. the BioCreative II.5 Workshop 2009 on Digital Annotations, Madrid, Spain, Oct. 7–9, 2009, p.19.
GasperIn C et al. Annotation of anaphoric relations in biomedical full-text articles using a domain-relevant scheme. In Proc. the Discourse Anaphora and Anaphor Resolution Colloquium, Lagos (Algarve), Portugal, March 29–30, 2007, pp.19–24.
McIntosh M, Curran J. Challenges for automatically extracting molecular interactions from full-text articles. BMC Bioinformatics, 2009, 10: 311.
Article Google Scholar
Kohn K W. Molecular interaction map of the mammalian cell cycle control and DNA repair systems. Mol. Biol. Cell, August 1, 1999, 10(8): 2703–2734.
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Information Science, “Academia Sinica”, 115, Taiwan, China
Hong-Jie Dai, Yen-Ching Chang & Wen-Lian Hsu (Fellow, IEEE)
Department of Computer Science, “National Tsing-Hua University”, 300, Taiwan, China
Hong-Jie Dai & Wen-Lian Hsu (Fellow, IEEE)
Department of Computer Science and Engineering, Yuan Ze University, 320, Taiwan, China
Richard Tzong-Han Tsai

Authors

Hong-Jie Dai
View author publications
You can also search for this author in PubMed Google Scholar
Yen-Ching Chang
View author publications
You can also search for this author in PubMed Google Scholar
Richard Tzong-Han Tsai
View author publications
You can also search for this author in PubMed Google Scholar
Wen-Lian Hsu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hong-Jie Dai.

Additional information

This work was supported by the “National Science Council” under Grant Nos. NSC 97-2218-E-155-001 and NSC96-2752-E-001-001-PAE, the Research Center for Humanities and Social Sciences, and the Thematic Program of “Academia Sinica” under Grant No. AS95ASIA02.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dai, HJ., Chang, YC., Tzong-Han Tsai, R. et al. New Challenges for Biological Text-Mining in the Next Decade. J. Comput. Sci. Technol. 25, 169–179 (2010). https://doi.org/10.1007/s11390-010-9313-5

Download citation

Received: 01 September 2009
Revised: 24 November 2009
Published: 20 January 2010
Issue Date: January 2010
DOI: https://doi.org/10.1007/s11390-010-9313-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

New Challenges for Biological Text-Mining in the Next Decade

Abstract

Access this article

Similar content being viewed by others

Supporting Biological Pathway Curation Through Text Mining

Text Mining in Bioinformatics

GPDminer: a tool for extracting named entities and analyzing relations in biological literature

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

New Challenges for Biological Text-Mining in the Next Decade

Abstract

Access this article

Similar content being viewed by others

Supporting Biological Pathway Curation Through Text Mining

Text Mining in Bioinformatics

GPDminer: a tool for extracting named entities and analyzing relations in biological literature

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation