“Similar query was answered earlier”: processing of patient authored text for retrieving relevant contents from health discussion forum

  • Sujan Kumar SahaEmail author
  • Amit Prakash
  • Mukta Majumder
Part of the following topical collections:
  1. Special Issue on Application of Artificial Intelligence in Health Research


Online remedy finders and health-related discussion forums have become increasingly popular in recent years. Common web users write their health problems there and request suggestion from experts or other users. As a result, these forums became a huge repository of information and discussions on various health issues. An intelligent information retrieval system can help to utilize this repository in various applications. In this paper, we propose a system for the automatic identification of existing similar forum posts given a new post. The system is based on computing similarity between two patient authored texts. For computing the similarity between the current post and existing posts, the system uses a hybrid strategy based on template information, topic modelling, and latent semantic indexing. The system is tested using a set of real questions collected from a homeopathy forum namely The relevance of the posts retrieved by the system is evaluated by human experts. The evaluation results demonstrate that the precision of the system is 88.87%.


Health information retrieval Patient authored text Web forum analysis Natural language processing Public health informatics 



The authors declare that they have received no funding for the current study.

Compliance with ethical standards

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

Informed consent was obtained from all individual participants included in the study.


  1. 1.
    Berlin A, Sorani M, Sim I. A taxonomic description of computer-based clinical decision support systems. J Biomed Inform. 2006;39(6):656–67.CrossRefGoogle Scholar
  2. 2.
    Wright A, Chen ES, Maloney FL. An automated technique for identifying associations between medications, laboratory results and problems. J Biomed Inform. 2010;43(6):891–901.CrossRefGoogle Scholar
  3. 3.
    Ordonez C. Association rule discovery with the train and test approach for heart disease prediction. IEEE Trans Inf Technol Biomed. 2006;10(2):334–43.CrossRefGoogle Scholar
  4. 4.
    Aronsky D, Chan KJ, Haug PJ. Evaluation of a computerized diagnostic decision support system for patients with pneumonia: study design considerations. J Am Med Inform Assoc. 2001;8(5):473–85.CrossRefGoogle Scholar
  5. 5.
    Liu J, Zhang Z, Wong DW, et al. Automatic glaucoma diagnosis through medical imaging informatics. J Am Med Inform Assoc. 2013;20(6):1021–7.CrossRefGoogle Scholar
  6. 6.
    Cimino JJ, Aguirre A, Johnson SB, Peng P. Generic queries for meeting clinical information needs. Bull Med Libr Assoc. 1993;81(2):195–206.Google Scholar
  7. 7.
    Yu H, Lee M, Kaufman D, et al. Development, implementation, and a cognitive evaluation of a definitional question answering system for physicians. J Biomed Inf. 2007;40(3):236–51.CrossRefGoogle Scholar
  8. 8.
    Yu H, Cao Y. Automatically extracting information needs from ad hoc clinical questions. In: AMIA annual symposium proceedings; 2008. p. 96–100.Google Scholar
  9. 9.
    Cao YG, Liu F, Simpson P. AskHERMES: an online question answering system for complex clinical questions. J Biomed Inform. 2011;44:277–88.CrossRefGoogle Scholar
  10. 10.
    Harkema H, Roberts I, Gaizauskas R, Hepple M. Information extraction from clinical records. In: Proceedings of the 4th UK e-Science All Hands Meeting; 2005. p. 19–22.Google Scholar
  11. 11.
    Sohn S, Clark C, Halgrim SR, Murphy SP, Jonnalagadda SR, Wagholikar KB. Analysis of cross-institutional medication description patterns in clinical narratives. Biomedical. 2013;6(2013):7–16.Google Scholar
  12. 12.
    Xu H, Stenner SP, Doan S, Johnson KB, Waitman LR, Denny JC. MedEx: a medication information extraction system for clinical narratives. J Am Med Inf Assoc. 2010;17(1):19–24.CrossRefGoogle Scholar
  13. 13.
    Savova GK, Masanz JJ, Ogren PV, Zheng J, Sohn S, Kipper-Schuler KC. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inf Assoc. 2010;17(2010):507–13.CrossRefGoogle Scholar
  14. 14.
    Aronson AR, Lang F-M. An overview of MetaMap: historical perspective and recent advances. J Am Med Inf Assoc. 2010;17(2010):229–36.CrossRefGoogle Scholar
  15. 15.
    Stewart A, Smith M, Nejdl W. A transfer approach to detecting disease reporting events in blog social media. In: Proceedings of the 22nd ACM conference on hypertext and hypermedia; 2011. p. 271–280.Google Scholar
  16. 16.
    Xu J, Gan L, Cheng M, Wu Q. Unsupervised medical entity recognition and linking in Chinese Online Medical Text. J Healthc Eng. 2018; 2018:Article ID 2548537Google Scholar
  17. 17.
    Chen Y, Guo W, Zhao X. A semantic based information retrieval model for blog. In: Third international symposium on electronic commerce and security; 2010. p. 257–60.Google Scholar
  18. 18.
    MacLean DL, Heer J. Identifying medical terms in patient-authored text: a crowdsourcing-based approach. J Am Med Inf Assoc. 2013;2013(20):1120–7.CrossRefGoogle Scholar
  19. 19.
    Ranjan H, Agarwal S, Prakash A, Saha SK. Automatic labelling of important terms and phrases from medical discussions. In: IEEE conference on information and communication technology; 2017. IEEE Explore.
  20. 20.
    Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural language processing (almost) from scratch. J Mach Learn Res (JMLR). 2011;12:2493–537.zbMATHGoogle Scholar
  21. 21.
    Blei DM, Ng AY, Jordan MI, Lafferty J. Latent Dirichlet allocation. J Mach Learn Res. 2003;3:993–1022.zbMATHGoogle Scholar
  22. 22.
    Kim H, Howland P, Park H. Dimension reduction in text classification with support vector machines. J Mach Learn Res. 2005;6(1):37–53.MathSciNetzbMATHGoogle Scholar
  23. 23.
    Manning CD, Raghavan P, Schütze H. Introduction to information retrieval. Cambridge: Cambridge University Press; 2008.CrossRefGoogle Scholar
  24. 24.
    Satyam A, Dawn AK, Saha SK. A statistical analysis approach to author identification using latent semantic analysis. Notebook for PAN at CLEF 2014. In: Proceedings of the CLEF2014 working notes, p. 1143–1147. ISSN 1613-0073. Sheffield, UK, 15–18 September 2014.Google Scholar
  25. 25.
    Prakash A, Saha SK. Experiments on document chunking and query formation for plagiarism source retrieval. Notebook for PAN at CLEF 2014. In: Proceedings of the CLEF2014 working notes. p. 990–996. ISSN 1613-0073. Sheffield, UK, 15–18 September 2014.Google Scholar
  26. 26.
    Plansangket S, Gan JQ. A query suggestion method combining TF-IDF and Jaccard coefficient for interactive web search. Artif Intell Res. 2015;4(2):119–25.CrossRefGoogle Scholar
  27. 27.
    Suchomel S, Brandejs, M. Improving synoptic quering for source retrieval—Notebook for PAN at CLEF 2015. CLEF 2015 Evaluation Labs and Workshop—Working Notes Papers, Toulouse, France, 8–11 September 2015.Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringBirla Institute of Technology MesraRanchiIndia
  2. 2.Department of Computer Science and ApplicationUniversity of North BengalWest BengalIndia

Personalised recommendations