Characteristics of Patient Records and Clinical Corpora

  • Hercules Dalianis
Open Access


This chapter specifically details the linguistic characteristics of patient record text in the form of spelling errors, domain specific abbreviations, negation and assertion expressions, etc. for English, Swedish and other languages.


  1. Afzal, Z., Pons, E., Kang, N., Sturkenboom, M. C. J. M., Schuemie, M. J., & Kors, J. A. (2014). ContextD: An algorithm to identify contextual properties of medical terms in a Dutch clinical corpus. BMC Bioinformatics, 15(1), 373.Google Scholar
  2. Allvin, H., Carlsson, E., Dalianis, H., Danielsson-Ojala, R., Daudaravicius, V., Hassel, M., et al. (2011). Characteristics of Finnish and Swedish intensive care nursing narratives: A comparative analysis to support the development of clinical language technologies. Journal of Biomedical Semantics, 2(Suppl 3), 1–11.CrossRefGoogle Scholar
  3. Aramaki, E., Miura, Y., Tonoike, M., Ohkuma, T., Masuichi, H., Waki, K., et al. (2010). Extraction of adverse drug effects from clinical records. Studies in Health Technology and Informatics, 160(Pt 1), 739–743.Google Scholar
  4. Asamura, H., Wittekind, C., & Sobin, L. H. (2014). TNM Atlas: Illustrated Guide to the TNM Classification of Malignant Tumours. New York: Wiley.Google Scholar
  5. Attardi, G., Cozza, V., & Sartiano, D. (2015). Annotation and extraction of relations from Italian medical records. In Proceedings of the 6th Italian Information Retrieval Workshop, Cagliari, Italy.Google Scholar
  6. Boytcheva, S., Angelova, G., Angelov, Z., & Tcharaktchiev, D. (2015). Text mining and big data analytics for retrospective analysis of clinical texts from outpatient care. Cybernetics and Information Technologies, 15(4), 58–77.Google Scholar
  7. Boytcheva, S., Nikolova, I., Angelova, G., & Angelov, Z. (2017b). Identification of risk factors in clinical texts through association rules. In Proceedings of RANLP Workshop on Biomedical Natural Language Processing (pp. 64–72).Google Scholar
  8. Cederblom, S. (2005). Medicinska förkortningar och akronymer. Studentlitteratur, Lund.Google Scholar
  9. Chapman, W. W., Bridewell, W., Hanbury, P., Cooper, G. F., & Buchanan, B. G. (2001). A simple algorithm for identifying negated findings and diseases in discharge summaries. Journal of Biomedical Informatics, 34(5), 301–310.CrossRefGoogle Scholar
  10. Chazard, E., Ficheur, G., Bernonville, S., Luyckx, M., & Beuscart, R. (2011). Data mining to generate adverse drug events detection rules. IEEE Transactions on Information Technology in Biomedicine, 15(6), 823–830.CrossRefGoogle Scholar
  11. Cotik, V., Filippo, D., Uszkoreit, H., & Xu, F. (2017). Annotation of entities and relations in Spanish radiology reports. In Proceedings of Recent Advances in Natural Language Processing, RANLP 2017, Varna, Bulgaria (pp. 177–184).Google Scholar
  12. Dalianis, H. (2014). Clinical text retrieval - An overview of basic building blocks and applications. In Professional Search in the Modern World (pp. 147–165). Berlin: Springer.Google Scholar
  13. Dalianis, H., Hassel, M., & Velupillai, S. (2009). The Stockholm EPR Corpus-characteristics and some initial findings. In Proceedings of ISHIMR 2009, Evaluation and Implementation of e-Health and Health Information Initiatives: International Perspectives. 14th International Symposium for Health Information Management Research (pp. 243–249).Google Scholar
  14. Dalianis, H., Henriksson, A., Kvist, M., Velupillai, S., & Weegar, R. (2015). HEALTH BANK–A workbench for data science applications in healthcare. In J. Krogstie, G. Juel-Skielse, & V. Kabilan (Eds.), Proceedings of the CAiSE-2015 Industry Track Co-located with 27th Conference on Advanced Information Systems Engineering (CAiSE 2015), Stockholm, Sweden, June 11, 2015, CEUR (Vol. 1381, pp. 1–18).
  15. Dalianis, H., & Skeppstedt, M. (2010). Creating and evaluating a consensus for negated and speculative words in a Swedish clinical corpus. In Proceedings of the Workshop on Negation and Speculation in Natural Language Processing (pp. 5–13). Association for Computational Linguistics.Google Scholar
  16. Ehrentraut, C., Tanushi, H., Tiedemann, J., & Dalianis, H. (2012). Detection of hospital acquired infections in sparse and noisy Swedish patient records. In Proceedings of the Sixth Workshop on Analytics for Noisy Unstructured Text Data (AND 2012) Held in Conjunction with Coling 2012, Bombay. ACM Digital Library.Google Scholar
  17. Eriksson, R., Jensen, P. B., Frankild, S., Jensen, L. J., & Brunak, S. (2013). Dictionary construction and identification of possible adverse drug events in Danish clinical narrative text. Journal of the American Medical Informatics Association, 20(5), 947–953.CrossRefGoogle Scholar
  18. Grigonyte, G., Kvist, M., Velupillai, S., & Wirén, M. Improving readability of Swedish electronic health records through lexical simplification: First results. In Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations – PITR, Gothenburg, Sweden, April 2014 (pp. 74–83). Association for Computational Linguistics. Accessed 11 Jan 2018.
  19. Groopman, J. E. (2007). How Doctors Think. New York: Houghton Mifflin Company.Google Scholar
  20. Grouin, C., & Névéol, A. (2014). De-identification of clinical notes in French: Towards a protocol for reference corpus development. Journal of Biomedical Informatics, 50, 151–161.CrossRefGoogle Scholar
  21. Isenius, N. (2012). Abbreviation Detection in Swedish Medical Records. The Development of SCAN, A Swedish Clinical Abbreviation Normalizer. Master’s thesis, Department of Computer and Systems Sciences, Stockholm University.Google Scholar
  22. Isenius, N., Velupillai, S., & Kvist, M. (2012). Initial results in the development of SCAN. A Swedish clinical abbreviation normalizer. In CLEFeHealth 2012 Workshop on Cross-Language Evaluation of Methods, Applications, and Resources for eHealth Document Analysis, Rome.Google Scholar
  23. Jensen, K., Soguero-Ruiz, C., Mikalsen, K. O., Lindsetmo, R.-O., Kouskoumvekaki, I., Girolami, M., et al. (2017). Analysis of free text in electronic health records for identification of cancer patient trajectories. Scientific Reports, 7, 46226.CrossRefGoogle Scholar
  24. Koeling, R., Carroll, J., Tate, A. R., & Nicholson, A. (2011). Annotating a corpus of clinical text records for learning to recognize symptoms automatically. In Proceedings of the 3rd Louhi Workshop on Text and Data Mining of Health Documents (pp. 43–50).Google Scholar
  25. Kvist, M., & Velupillai, S. (2014). SCAN: A Swedish clinical abbreviation normalizer. Further development and adaptation to radiology. In International Conference of the Cross-Language Evaluation Forum for European Languages (pp. 62–73). Berlin: Springer.Google Scholar
  26. Lewis, J. D., Schinnar, R., Bilker, W. B., Wang, X., & Strom, B. L. (2007). Validation studies of the health improvement network (THIN) database for pharmacoepidemiology research. Pharmacoepidemiology and Drug safety, 16(4), 393–401.CrossRefGoogle Scholar
  27. Liu, H., Lussier, Y. A., & Friedman, C. (2001). A study of abbreviations in the UMLS. In AMIA Annual Symposium Proceedings (p. 393). American Medical Informatics Association.Google Scholar
  28. Lövestam, E., Velupillai, S., & Kvist, M. (2014). Abbreviations in Swedish clinical text - Use by three professions. Studies in Health Technology and Informatics, 205, 720–724. Scholar
  29. Marciniak, M., & Mykowiecka, A. (2014). Terminology extraction from medical texts in Polish. Journal of Biomedical Semantics, 5(1), 24.CrossRefGoogle Scholar
  30. Névéol, A., Dalianis, H., Savova, G., & Zweigenbaum, P. (2018). Clinical natural language processing in languages other than english: opportunities and challenges. Journal of Biomedical Semantics, 9(12), 1–13.Google Scholar
  31. Nguyen, A. N., Moore, J., O’Dwyer, J., & Philpot, S. (2016). Automated cancer registry notifications: validation of a medical text analytics system for identifying patients with cancer from a state-wide pathology repository. In AMIA Annual Symposium Proceedings (pp. 964–973). American Medical Informatics Association.Google Scholar
  32. Nizamuddin, N., & Dalianis, H. (2014). Detection of spelling errors in Swedish clinical text. In 1st Nordic Workshop on Evaluation of Spellchecking and Proofing Tools (NorWEST2014), SLTC 2014.Google Scholar
  33. Olsson, M. (2011). Vem begriper patientjournalen? (In Swedish). Bachelor’s thesis, Linnaeus University.Google Scholar
  34. Pakhomov, S., Pedersen, T., & Chute, C. G. (2005). Abbreviation and acronym disambiguation in clinical discourse. In AMIA Annual Symposium Proceedings (Vol. 2005, p. 589). American Medical Informatics Association.Google Scholar
  35. Pantazos, K., Lauesen, S., & Lippert, S. (2016). Preserving medical correctness, readability and consistency in de-identified health records. Health Informatics Journal, 23(4), 291–303.CrossRefGoogle Scholar
  36. Patrick, J., & Nguyen, D. (2011). Automated proof reading of clinical notes. In PACLIC, 25th Pacific Asia Conference on Language, Information and Computation (pp. 303–312).Google Scholar
  37. Perera, G., Broadbent, M., Callard, F., Chang, C.-K., Downs, J., Dutta, R., et al. (2016). Cohort profile of the South London and Maudsley NHS Foundation Trust Biomedical Research Centre (SLaM BRC) Case Register: current status and recent enhancement of an electronic mental health record-derived data resource. BMJ Open, 6(3), e008721.CrossRefGoogle Scholar
  38. Pérez, A., Weegar, R., Casillas, A., Gojenola, K., Oronoz, M., & Dalianis, H. (2017). Semi-supervised medical entity recognition: A study on Spanish and Swedish clinical corpora. Journal of Biomedical Informatics, 71, 16–30.CrossRefGoogle Scholar
  39. Pestian, J. P., Brew, C., Matykiewicz, P., Hovermale, D. J., Johnson, N., Cohen, K. B., et al. (2007). A shared task involving multi-label classification of clinical free text. In Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing (pp. 97–104). Association for Computational Linguistics.Google Scholar
  40. Proux, D., Hagège, C., Gicquel, Q., Pereira, S., Darmoni, S., Segond, F., et al. (2011). Architecture and systems for monitoring hospital acquired infections inside a hospital information workflow. In Proceedings of the Workshop on Biomedical Natural Language Processing. USA: Portland, Oregon (p. 43e48). Citeseer.Google Scholar
  41. Roberts, A., Gaizauskas, R., Hepple, M., Demetriou, G., Guo, Y., Roberts, I., et al. (2009). Building a semantically annotated corpus of clinical texts. Journal of Biomedical Informatics, 42(5), 950–966.CrossRefGoogle Scholar
  42. Roller, R., Uszkoreit, H., Xu, F., Seiffe, L., Mikhailov, M., Staeck, O., et al. (2016). A fine-grained corpus annotation schema of German nephrology records. In Proceedings of the Clinical Natural Language Processing Workshop, Osaka, Japan, December 11–17 (pp. 69–77).Google Scholar
  43. Ruch, P., Robert, B., & Antoine, G. (2003). Using lexical disambiguation and named-entity recognition to improve spelling correction in the electronic patient record. Artificial Intelligence in Medicine, 29(1), 169–184.CrossRefGoogle Scholar
  44. Saeed, M., Villarroel, M., Reisner, A. T., Clifford, G., Lehman, L.-W., Moody, G., et al. (2011). Multiparameter intelligent monitoring in intensive care II (MIMIC-II): A public-access intensive care unit database. Critical Care Medicine, 39(5), 952.CrossRefGoogle Scholar
  45. Saurí, R., & Pustejovsky, J. (2009). Factbank: A corpus annotated with event factuality. Language Resources and Evaluation, 43(3), 227–268.CrossRefGoogle Scholar
  46. Siklósi, B., Novák, A., & Prószéky, G. (2014). Resolving abbreviations in clinical texts without pre-existing structured resources. In Fourth Workshop on Building and Evaluating Resources for Health and Biomedical Text Processing, LREC (Vol. 2014).Google Scholar
  47. Skeppstedt, M., Kvist, M., & Dalianis, H. (2012). Rule-based entity recognition and coverage of SNOMED CT in Swedish clinical text. In Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012 (pp. 1250–1257).Google Scholar
  48. Spat, S., Cadonna, B., Rakovac, I., Gütl, C., Leitner, H., Stark, G., et al. (2008). Enhanced information retrieval from narrative German-language clinical text documents using automated document classification. Studies in Health Technology and Informatics, 136, 473.Google Scholar
  49. Velupillai, S. (2011). Automatic classification of factuality levels: A case study on Swedish diagnoses and the impact of local context. In Fourth International Symposium on Languages in Biology and Medicine, LBM 2011.Google Scholar
  50. Velupillai, S. (2012). Shades of Certainty: Annotation and Classification of Swedish Medical Records. PhD thesis, Stockholm University.Google Scholar
  51. Velupillai, S., Dalianis, H., & Kvist, M. (2011). Factuality levels of diagnoses in Swedish clinical text. In MIE-Medical Informatics Europe (pp. 559–563).
  52. Vincze, V., Szarvas, G., Farkas, R., Móra, G., & Csirik, J. (2008). The BioScope Corpus: Biomedical texts annotated for uncertainty, negation and their scopes. BMC Bioinformatics, 9(Suppl 11), S9.CrossRefGoogle Scholar
  53. Weegar, R., & Dalianis, H. (2015). Creating a rule based system for text mining of Norwegian breast cancer pathology reports. In Sixth International Workshop in Health Text Mining and Information Analysis (LOUHI), Held in Conjunction with EMNLP 2015, Lisbon, Portugal (pp. 73–78).Google Scholar
  54. Wu, Y., Rosenbloom, S. T., Denny, J. C., Miller, R. A., Mani, S., Giuse, D. A., et al. (2011). Detecting abbreviations in discharge summaries using machine learning methods. In AMIA Annual Symposium Proceedings (Vol. 2011, p. 1541). American Medical Informatics Association.Google Scholar
  55. Zhang, S., Kang, T., Zhang, X., Wen, D., Elhadad, N., & Lei, J. (2016). Speculation detection for Chinese clinical notes: Impacts of word segmentation and embedding models. Journal of Biomedical Informatics, 60, 334–341.CrossRefGoogle Scholar
  56. Zubke, M. (2017). Classification based extraction of numeric values from clinical narratives. In Proceedings of RANLP Workshop on Biomedical Natural Language Processing (pp. 24–31).Google Scholar

Copyright information

© The Author(s) 2018

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Authors and Affiliations

  • Hercules Dalianis
    • 1
  1. 1.DSV-Stockholm UniversityKistaSweden

Personalised recommendations