Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Constructing a knowledge-based heterogeneous information graph for medical health status classification

  • 19 Accesses


Applying Pearson correlation and semantic relations in building a heterogeneous information graph (HIG) to develop a classification model has achieved a notable performance in improving the accuracy of predicting the status of health risks. In this study, the approach that was used, integrated knowledge of the medical domain as well as taking advantage of applying Pearson correlation and semantic relations in building a classification model for diagnosis. The research mined knowledge which was extracted from titles and abstracts of MEDLINE to discover how to assess the links between objects relating to medical concepts. A knowledge-base HIG model then was developed for the prediction of a patient’s health status. The results of the experiment showed that the knowledge-base model was superior to the baseline model and has demonstrated that the knowledge-base could help improve the performance of the classification model. The contribution of this study has been to provide a framework for applying a knowledge-base in the classification model which helps these models achieve the best performance of predictions. This study has also contributed a model to medical practice to help practitioners become more confident in making final decisions in diagnosing illness. Moreover, this study affirmed that biomedical literature could assist in building a classification model. This contribution will be advantageous for future researchers in mining the knowledge-base to develop different kinds of classification models.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10


  1. 1.


  2. 2.



  1. 1.

    Anupindi TR, Srinivasan P. Disease comorbidity linkages between medline and patient data. In: 2017 IEEE International Conference on Healthcare Informatics (ICHI) IEEE; 2017. pp. 403–408.

  2. 2.

    Banuqitah H, Eassa F, Jambi K, Abulkhair M. Two level self-supervised relation extraction from medline using umls. Int J Data Min Knowl Manag Process IJDKP. 2016;6(3):11–23.

  3. 3.

    Biswas RK, Kabir E. Influence of distance between residence and health facilities on non-communicable diseases: an assessment over hypertension and diabetes in bangladesh. PLoS ONE. 2017;12(5):e0177027.

  4. 4.

    Böckmann B, Heiden K. Extracting and transforming clinical guidelines into pathway models for different hospital information systems. Health Inf Sci Syst. 2013;1(1):13.

  5. 5.

    Bowes D, Hall T, Gray D. Comparing the performance of fault prediction models which report multiple performance measures: recomputing the confusion matrix. In: Proceedings of the 8th International Conference on Predictive Models in Software Engineering. ACM; 2012. pp. 109–118.

  6. 6.

    Boytcheva S, Angelova G, Angelov Z, Tcharaktchiev D. Mining comorbidity patterns using retrospective analysis of big collection of outpatient records. Health Inf Sci Syst. 2017;5(1):3.

  7. 7.

    Cases M, Furlong LI, Albanell J, Altman RB, Bellazzi R, Boyer S, Brand A, Brookes AJ, Brunak S, Clark TW, et al. Improving data and knowledge management to better integrate health care and research. J Intern Med. 2013;274(4):321–8.

  8. 8.

    Chen ES, Hripcsak G, Xu H, Markatou M, Friedman C. Automated acquisition of disease-drug knowledge from biomedical and clinical documents: an initial study. J Am Med Inf Assoc. 2008;15(1):87–98.

  9. 9.

    Chen L, Li X, Sheng QZ, Peng WC, Bennett J, Hu HY, Huang N. Mining health examination records—a graph-based approach. IEEE Trans Knowl Data Eng. 2016;28(9):2423–37.

  10. 10.

    Costa JP, Stopar L, Fuart F, Grobelnik M, Santanam R, Sun C, Carlin P, Black M, Wallace J. Mining medline for the visualisation of a global perspective on biomedical knowledge. In: KDD 2018 (24th ACM SIGKDD Conference on Knowledge Discovery and Data Mining); 2018.

  11. 11.

    Escudié JB, Rance B, Malamut G, Khater S, Burgun A, Cellier C, Jannot AS. A novel data-driven workflow combining literature and electronic health records to estimate comorbidities burden for a specific disease: a case study on autoimmune comorbidities in patients with celiac disease. BMC Med Inf Decis Mak. 2017;17(1):140.

  12. 12.

    Goh WP, Tao X, Zhang J, Yong J. Decision support systems for adoption in dental clinics: a survey. Knowl Based Syst. 2016;104:195–206.

  13. 13.

    Hanauer DA, Saeed M, Zheng K, Mei Q, Shedden K, Aronson AR, Ramakrishnan N. Applying metamap to medline for identifying novel associations in a large clinical dataset: a feasibility analysis. J Am Med Inf Assoc. 2014;21(5):925–37.

  14. 14.

    Hidalgo CA, Blumm N, Barabási AL, Christakis NA. A dynamic network approach for the study of human phenotypes. PLoS Comput Biol. 2009;5(4):e1000353.

  15. 15.

    Huang Z, Yang J, van Harmelen F, Hu Q. Constructing knowledge graphs of depression. In: International Conference on Health Information Science. Springer; 2017. pp. 149–161.

  16. 16.

    Ji M, Han J, Danilevsky M. Ranking-based classification of heterogeneous information networks. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM; 2011. pp. 1298–1306.

  17. 17.

    Jiang Y, Qiu B, Xu C, Li C. The research of clinical decision support system based on three-layer knowledge base model. J Healthcare Eng. (2017).

  18. 18.

    Kavuluru R, Han S, Harris D. Unsupervised extraction of diagnosis codes from emrs using knowledge-based and extractive text summarization techniques. In: Canadian conference on artificial intelligence. Springer; 2013. pp. 77–88.

  19. 19.

    Lei X, Zhang Y. Predicting disease-genes based on network information loss and protein complexes in heterogeneous network. Inf Sci. 2019;479:386–400.

  20. 20.

    Liu YI, Wise PH, Butte AJ. The “etiome”: identification and clustering of human disease etiological factors. In: BMC bioinformatics. vol. 10, p. S14. BioMed Central; 2009.

  21. 21.

    Luo C, Guan R, Wang Z, Lin C. Hetpathmine: A novel transductive classification algorithm on heterogeneous information networks. In: European Conference on Information Retrieval. Springer; 2014. pp. 210–221.

  22. 22.

    Luo G. Automatically explaining machine learning prediction results: a demonstration on type 2 diabetes risk prediction. Health Inf Sci Syst. 2016;4(1):2.

  23. 23.

    Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems; 2013. pp. 3111–3119.

  24. 24.

    Pereira S, Névéol A, Massari P, Joubert M, Darmoni S. Construction of a semi-automated icd-10 coding help system to optimize medical and economic coding. In: MIEl; 2006. pp. 845–850.

  25. 25.

    Perotte A, Ranganath R, Hirsch JS, Blei D, Elhadad N. Risk prediction for chronic kidney disease progression using heterogeneous electronic health record data and time series analysis. J Am Med Inf Assoc. 2015;22(4):872–80.

  26. 26.

    Pham T, Tao X, Zhanag J, Yong J, Zhang W, Cai Y. Mining heterogeneous information graph for health status classification. In: 2018 5th International Conference on Behavioral, Economic, and Socio-Cultural Computing (BESC). IEEE; 2018. pp. 73–78.

  27. 27.

    Saitwal H, Qing D, Jones S, Bernstam EV, Chute CG, Johnson TR. Cross-terminology mapping challenges: a demonstration using medication terminological systems. J Biomed Inform. 2012;45(4):613–25.

  28. 28.

    Schriml LM, Mitraka E, Munro J, Tauber B, Schor M, Nickle L, Felix V, Jeng L, Bearer C, Lichenstein R, et al. Human disease ontology 2018 update: classification, content and workflow expansion. Nucleic Acids Res. 2018;47(D1):D955–62.

  29. 29.

    Shah S, Luo X, Kanakasabai S, Tuason R, Klopper G. Neural networks for mining the associations between diseases and symptoms in clinical notes. Health Inf Sci Syst. 2019;7(1):1.

  30. 30.

    Shakeel PM, Baskar S, Dhulipala VS, Jaber MM. Cloud based framework for diagnosis of diabetes mellitus using k-means clustering. Health Inf Sci Syst. 2018;6(1):16.

  31. 31.

    Soualmia LF, Sakji S, Letord C, Rollin L, Massari P, Darmoni SJ. Improving information retrieval with multiple health terminologies in a quality-controlled gateway. Health Inf Sci Syst. 2013;1(1):8.

  32. 32.

    Srinivasan S, Rindflesch TC, Hole WT, Aronson AR, Mork JG. Finding umls metathesaurus concepts in medline. In: Proceedings of the AMIA Symposium. p. 727. American Medical Informatics Association; 2002.

  33. 33.

    Sun Y, Han J. Mining heterogeneous information networks: a structural analysis approach. Acm Sigkdd Explorations Newsl. 2013;14(2):20–8.

  34. 34.

    Supriya S, Siuly S, Wang H, Cao J, Zhang Y. Weighted visibility graph with complex network features in the detection of epilepsy. IEEE Access. 2016;4:6554–66.

  35. 35.

    Tateisi Y. Resources for assigning mesh IDs to Japanese medical terms. Genomics Inform. 2019;17(2):e16.

  36. 36.

    Wang H, Zhang Q, Yuan J. Semantically enhanced medical information retrieval system: a tensor factorization based approach. IEEE Access. 2017;5:7584–93.

  37. 37.

    Wang L, Del Fiol G, Bray BE, Haug PJ. Generating disease-pertinent treatment vocabularies from medline citations. J Biomed Inform. 2017;65:46–57.

  38. 38.

    Xiong Y, Ruan L, Guo M, Tang C, Kong X, Zhu Y, Wang W. Predicting disease-related associations by heterogeneous network embedding. In: 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE; 2018. pp. 548–555.

  39. 39.

    Xu R, Li L, Wang Q. driskkb: a large-scale disease-disease risk relationship knowledge base constructed from biomedical text. BMC Bioinform. 2014;15(1):105.

  40. 40.

    Xu R, Wang Q. Large-scale extraction of accurate drug-disease treatment pairs from biomedical literature for drug repurposing. BMC Bioinform. 2013;14(1):181.

  41. 41.

    Xu R, Wang Q. Toward creation of a cancer drug toxicity knowledge base: automatically extracting cancer drug-side effect relationships from the literature. J Am Med Inf Assoc. 2013;21(1):90–6.

  42. 42.

    Zeng Q, Cimino JJ. Automated knowledge extraction from the umls. In: Proceedings of the AMIA Symposium. p. 568. American Medical Informatics Association; 1998.

  43. 43.

    Zhang Y, Srimani PK, Wang JZ. Combining mesh thesaurus with umls in pseudo relevance feedback to improve biomedical information retrieval. In: 2016 IEEE International Conference on Knowledge Engineering and Applications (ICKEA). IEEE; 2016. pp. 67–71.

  44. 44.

    Zhao D, Weng C. Combining pubmed knowledge and ehr data to develop a weighted bayesian network for pancreatic cancer prediction. J Biomed Inform. 2011;44(5):859–68.

  45. 45.

    Zheng G, Callan J. Learning to reweight terms with distributed representations. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. ACM; 2015. pp. 575–584.

Download references


The work is conducted with approval from the Human Research Ethics Committee of the University of Southern Queensland, Australia (Approval ID: H18REA049). The authors acknowledge the use of the National Health and Nutrition Examination Survey (NHANES) and National Ambulatory Medical Care Survey (NAMCS) in the study and especially, thank the Centers for Disease Control and Prevention of the Department of Health and Human Services, the United States for making the data set publicly available for research purpose. The authors also appreciate the courtesy of the U.S. National Library of Medicine for allowing the use of MEDLINE.

Author information

Correspondence to Thuan Pham.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Pham, T., Tao, X., Zhang, J. et al. Constructing a knowledge-based heterogeneous information graph for medical health status classification. Health Inf Sci Syst 8, 10 (2020). https://doi.org/10.1007/s13755-020-0100-6

Download citation


  • Knowledge graph
  • Electronic health data
  • Classification
  • Healthcare