Skip to main content

Text Mining in Medicine

  • Chapter
  • First Online:
Computational Medicine in Data Mining and Modeling
  • 1713 Accesses

Abstract

Many medical applications and current ongoing medical research depend on text mining techniques. It is estimated that 90 % of all data is unstructured, such as emails, voice or video records, data streams, and Word documents. In the last decade, the estimated growth of unstructured data is about 62 %, whereas the amount of structured data has grown only by 22 %. In this chapter we therefore overview some methods and tools that enable researchers to automatically retrieve, extract, and integrate unstructured medical data. Due to increasing number of unstructured documents, the automatic text mining methods ease access to relevant data, already conducted research along with its results, and save money by trying to eliminate repeated research experiments. Natural language processing is lately receiving a lot of attention because researchers are trying to adapt techniques from other domains to work on biomedical data. We focus especially on methods from the fields of (1) information retrieval (indexing, searching, and retrieval of relevant documents given an input query), (2) information extraction (automatic extraction of structured data from unstructured sources with the main tasks of named entity recognition, relationship extraction, and coreference resolution), and (3) data integration (data merging and redundancy elimination).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Nguyen NLT, Kim JD, Miwa M et al (2012) Improving protein coreference resolution by simple semantic classification. BMC bioinformatics 13:304–325

    Article  Google Scholar 

  2. Cunningham H, Maynard D, Bontcheva K et al (2011) Text Processing with GATE (Version 6). University of Sheffield Department of Computer Science, Sheffield

    Google Scholar 

  3. Cunningham H, Tablan V, Roberts A, Bontcheva K (2013) Getting More Out of Biomedical Documents with GATE’s Full Lifecycle Open Source Text Analytics. PLoS Computational Biology 9:1–16

    Article  Google Scholar 

  4. Ferrucci D, Lally A (2004) UIMA: an architectural approach to unstructured information processing in the corporate research environment. Natural Language Engineering 10:327–348

    Article  Google Scholar 

  5. Toutanova K, Klein D, Manning C et al (2011) Stanford Core NLP. The Stanford Natural Language Processing Group. http://nlp.stanford.edu/software/corenlp.shtml. Accessed 20 March 2013

  6. Hall D, Ramage D (2013) Breeze. Berkeley NLP Group. http://www.scalanlp.org. Accessed 20 March 2013

  7. Kottmann J, Margulies B, Ingersoll G et al (2010) Apache OpenNLP. The Apache Software Foundation. http://opennlp.apache.org. Accessed 20 March 2013

  8. Bird S, Loper E, Klein E (2009) Natural Language Processing with Python. O’Reilly Media, Sebastopol

    MATH  Google Scholar 

  9. Gamalo P (2009) DepPattern. Grupo de Gramatica do Espanol. http://gramatica.usc.es/pln/tools/deppattern.html. Accessed 20 March 2013

  10. Padró L, Stanilovsky E (2012) FreeLing 3.0: Towards Wider Multilinguality. Proceedings of the Language Resources and Evaluation Conference. Turkey, Istanbul 2473–2479

    Google Scholar 

  11. Björne J, Ginter F, Salakoski T (2012) University of Turku in the BioNLP’11 Shared Task. BMC Bioinformatics, 13:1–13

    Article  Google Scholar 

  12. Barnickel T, Weston J, Collobert R et al (2009) Large Scale Application of Neural Network Based Semantic Role Labeling for Automated Relation Extraction from Biomedical Texts. PLoS ONE 4:1–6

    Article  Google Scholar 

  13. Szklarczyk D, Franceschini A, Kuhn M et al (2010) The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Researc 39:561–568

    Article  Google Scholar 

  14. Mostafavi S, Ray D, Warde-Farley D et al (2008) GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function. Genome Biology 9:1–15

    Article  Google Scholar 

  15. Fontaine JF, Priller F, Barbosa-Silva A, Andrade-Navarro MA (2011) Genie: literature-based gene prioritization at multi genomic scale. Nucleic Acids Research 39:455–461

    Article  Google Scholar 

  16. Tsuruoka Y, Tsujii J (2005) Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data. Proceedings of Human Language Technology Conference/EMNLP 2005. Vancouver, Canada 467–474

    Google Scholar 

  17. Allison JJ, Kiefe CI, Carter J, Centor RM (1999) The art and science of searching MEDLINE to answer clinical questions. Finding the right number of articles. International Journal of Technology Assess in Health Care 15:281–296

    Google Scholar 

  18. Hamosh A, Dcott AF, Amberger JS et al (2005) Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Research 33:514–517

    Article  Google Scholar 

  19. Gruber TR (1993) A translation approach to portable ontologies. Knowledge Acquisition 5:199–220

    Article  Google Scholar 

  20. Berners-Lee T, Hendler J, Lassila O (2001) The Semantic Web. Scientific American 284:28–37

    Article  Google Scholar 

  21. Jin-Dong K, Ohta T, Teteisi Y, Tsujii J (2003) GENIA corpus - a semantically annotated corpus for bio-text mining. Bioinformatics 19:180–182

    Article  Google Scholar 

  22. Pyysalo S, Ginter F, Heimonen J et al (2007) BioInfer: a corpus for information extraction in the biomedical domain. BMC bioinformatics, 8:50–74

    Article  Google Scholar 

  23. Rogers FB (1963) Medical Subject Headings. Bulletin of the Medical Library Association 51:114–116

    Google Scholar 

  24. Spackman KA, Campbell KE (1998) Compositional concept representation using SNOMED: towards further convergence of clinical terminologies. Proceedings of the AMIA Symposium. Orlando, Florida 740–744

    Google Scholar 

  25. Ashburner M, Ball CA, Blake JA et al (2000) Gene Ontology: tool for the unification of biology. Nature genetics 25:1–25

    Article  Google Scholar 

  26. Xie B, Ding Q, Han H, Wu D (2013) miRCancer: a microRNA–cancer association database constructed by text mining on literature. Bioinformatics 29:638–644

    Article  Google Scholar 

  27. Manning CD, Raghavan P, Schütze H (2008) Introduction to Information Retrieval. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  28. Sarawagi S (2008) Information Extraction. Foundations and Trends in Databases 1:261–377

    Article  Google Scholar 

  29. Bush V (1945) As We May Think. The Atlantic Monthly 176:101–108

    Google Scholar 

  30. Fallows D (2004) The internet and daily life. Pew/Internet and American Life Project. http://www.pewinternet.org/Reports/2004/The-Internet-and-Daily-Life.aspx. Accessed 21 March 2013

  31. Witten IH, Frank E (2005) Data Mining: Practical Machine Learning Tools and Techniques (Second Edition). Morgan Kaufmann Publishers, San Francisco

    Google Scholar 

  32. Broder A (2002) A taxonomy of web search. ACM SIGIR Forum 36:3–10

    Article  Google Scholar 

  33. Newman MEJ (2010) Networks: an introduction. Oxford University Press, Oxford

    Book  Google Scholar 

  34. Trček D, Trobec R, Pavešić N, Tasić J (2007) Information systems security and human behaviour. Behaviour and Information Technology 26:113–118

    Article  Google Scholar 

  35. Nagy M, Vargas-Vera M, Motta E (2008) Managing conflicting beliefs with fuzzy trust on the semantic web. Proceedings of the Mexican International Conference on Advances in Artificial Intelligence 827–837

    Google Scholar 

  36. Richardson M, Agrawal R, Domingos P (2003) Trust management for the semantic web. Proceedings of the International Semantic Web Conference 351–368.

    Google Scholar 

  37. Žitnik S, Šubelj L, Lavbič D et al (2013) General Context-Aware Data Matching and Merging Framework. Informatica 24:1–34

    Google Scholar 

  38. Bhattacharya I, Getoor L (2007) Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data 1:5–40.

    Article  Google Scholar 

  39. Lafferty JD, McCallum A and Pereira FCN. Conditional random fields: Probabilistic models for segmenting and labeling sequence data, Proceedings of the Eighteenth International Conference on Machine Learning, San Francisco: Morgan Kaufmann, 2001, pp. 282–289.

    Google Scholar 

  40. Soon WM, Ng HT and Lim DCY. A machine learning approach to coreference resolution of noun phrases, Computational linguistics, 2001, 27: 521–544.

    Article  Google Scholar 

  41. Ng V, Cardie C (2002) Improving machine learning approaches to coreference resolution. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics 104–111

    Google Scholar 

  42. Bengtson E, Roth D (2008) Understanding the value of features for coreference resolution. Proceedings of the Conference on Empirical Methods in Natural Language Processing 294–303

    Google Scholar 

  43. Miller GA (1995) WordNet: A Lexical Database for English. Communications of the ACM 38:39–41

    Article  Google Scholar 

  44. Grishman R, Sundheim B (1996) Message understanding conference-6: A brief history. Proceedings of the 16th Conference on Computational Linguistics. Morristown, USA 466–471

    Google Scholar 

  45. NIST (1998-present) Automatic Content Extraction (ACE) Program

    Google Scholar 

  46. Recasens M, Marquez L, Sapena E et al (2010) Semeval-2010 task 1: Coreference resolution in multiple languages. Proceedings of the 5th International Workshop on Semantic Evaluation. Uppsala, Sweden 1–8

    Google Scholar 

  47. Pradhan S, Moschitti A, Xue N, Uryupina O, Zhang Y (2012) CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes. Proceedings CoNLL '12 Joint Conference on EMNLP and CoNLL - Shared Task. Pennsylvania, USA 129–135

    Google Scholar 

  48. Chincor N (1991) MUC-3 Evaluation metrics. Proceedings of the 3rd conference on Message understanding. Pennsylvania, USA 17–24

    Google Scholar 

  49. Chincor N, Sundeheim B (1993) MUC-5 Evaluation metrics. Proceedings of the 5th conference on Message understanding. Pennsylvania, USA 69–78

    Google Scholar 

  50. Vilain M, Burger J, Aberdeen J, Connolly D, Hirschman L (1995) A model-theoretic coreference scoring scheme. Proceedings of the sixth conference on Message understanding. Pennsylvania, USA 45–52

    Google Scholar 

  51. Bagga A, Baldwin B (1998) Algorithms for scoring coreference chains. The first international conference on language resources and evaluation workshop on linguistics coreference. Pennsylvania, USA 563–566

    Google Scholar 

  52. Luo X (2005) On coreference resolution performance metrics. Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing. Vancouver, Canada 25–32

    Google Scholar 

  53. Recasens M, Hovy E (2011) BLANC: Implementing the Rand index for coreference evaluation. Natural Language Engineering 17:485–510

    Article  Google Scholar 

  54. Rabiner L (1989) A tutorial on Hidden Markov Models and selected applications in speech recognition. Proceedings of the IEEE 77:257–286

    Article  Google Scholar 

  55. McCallum A, Freitag D, Pereira F (2000) Maximum entropy markov models for information extraction and segmentation. Proceedings of the International Conference on Machine Learning. Palo Alto, USA 591–598

    Google Scholar 

  56. Klein D, Manning CD (2002) Conditional structure versus conditional estimation in NLP models. Workshop on Empirical Methods in Natural Language Processing. Philadelphia, USA 1–8

    Google Scholar 

  57. DeRose SJ (1988) Grammatical category disambiguation by statistical optimization. Computational Linguistics 14:31–39

    Google Scholar 

  58. Verspoor KM, Cohn JD, Ravikumar KE, Wall ME (2012) Text Mining Improves Prediction of Protein Functional Sites. PLoS ONE 7:e32171.

    Article  Google Scholar 

  59. Park J, Costanzo MC, Balakrishnan R et al (2012) CvManGO, a method for leveraging computational predictions to improve literature-based Gene Ontology annotations. Database, doi:10.1093/database/bas001

    Google Scholar 

  60. Krallinger M, Leitner F, Vazquez M et al (2012) How to link ontologies and protein–protein interactions to literature: text-mining approaches and the BioCreative experience. Database, doi:10.1093/database/bas017

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Slavko Žitnik .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer Science+Business Media New York

About this chapter

Cite this chapter

Žitnik, S., Bajec, M. (2013). Text Mining in Medicine. In: Rakocevic, G., Djukic, T., Filipovic, N., Milutinović, V. (eds) Computational Medicine in Data Mining and Modeling. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8785-2_4

Download citation

  • DOI: https://doi.org/10.1007/978-1-4614-8785-2_4

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4614-8784-5

  • Online ISBN: 978-1-4614-8785-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics