Constructing a knowledge-based heterogeneous information graph for medical health status classification


Applying Pearson correlation and semantic relations in building a heterogeneous information graph (HIG) to develop a classification model has achieved a notable performance in improving the accuracy of predicting the status of health risks. In this study, the approach that was used, integrated knowledge of the medical domain as well as taking advantage of applying Pearson correlation and semantic relations in building a classification model for diagnosis. The research mined knowledge which was extracted from titles and abstracts of MEDLINE to discover how to assess the links between objects relating to medical concepts. A knowledge-base HIG model then was developed for the prediction of a patient’s health status. The results of the experiment showed that the knowledge-base model was superior to the baseline model and has demonstrated that the knowledge-base could help improve the performance of the classification model. The contribution of this study has been to provide a framework for applying a knowledge-base in the classification model which helps these models achieve the best performance of predictions. This study has also contributed a model to medical practice to help practitioners become more confident in making final decisions in diagnosing illness. Moreover, this study affirmed that biomedical literature could assist in building a classification model. This contribution will be advantageous for future researchers in mining the knowledge-base to develop different kinds of classification models.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10


  1. 1.

  2. 2.


  1. 1.

    Anupindi TR, Srinivasan P. Disease comorbidity linkages between medline and patient data. In: 2017 IEEE International Conference on Healthcare Informatics (ICHI) IEEE; 2017. pp. 403–408.

  2. 2.

    Banuqitah H, Eassa F, Jambi K, Abulkhair M. Two level self-supervised relation extraction from medline using umls. Int J Data Min Knowl Manag Process IJDKP. 2016;6(3):11–23.

    Article  Google Scholar 

  3. 3.

    Biswas RK, Kabir E. Influence of distance between residence and health facilities on non-communicable diseases: an assessment over hypertension and diabetes in bangladesh. PLoS ONE. 2017;12(5):e0177027.

    Article  Google Scholar 

  4. 4.

    Böckmann B, Heiden K. Extracting and transforming clinical guidelines into pathway models for different hospital information systems. Health Inf Sci Syst. 2013;1(1):13.

    Article  Google Scholar 

  5. 5.

    Bowes D, Hall T, Gray D. Comparing the performance of fault prediction models which report multiple performance measures: recomputing the confusion matrix. In: Proceedings of the 8th International Conference on Predictive Models in Software Engineering. ACM; 2012. pp. 109–118.

  6. 6.

    Boytcheva S, Angelova G, Angelov Z, Tcharaktchiev D. Mining comorbidity patterns using retrospective analysis of big collection of outpatient records. Health Inf Sci Syst. 2017;5(1):3.

    Article  Google Scholar 

  7. 7.

    Cases M, Furlong LI, Albanell J, Altman RB, Bellazzi R, Boyer S, Brand A, Brookes AJ, Brunak S, Clark TW, et al. Improving data and knowledge management to better integrate health care and research. J Intern Med. 2013;274(4):321–8.

    Article  Google Scholar 

  8. 8.

    Chen ES, Hripcsak G, Xu H, Markatou M, Friedman C. Automated acquisition of disease-drug knowledge from biomedical and clinical documents: an initial study. J Am Med Inf Assoc. 2008;15(1):87–98.

    Article  Google Scholar 

  9. 9.

    Chen L, Li X, Sheng QZ, Peng WC, Bennett J, Hu HY, Huang N. Mining health examination records—a graph-based approach. IEEE Trans Knowl Data Eng. 2016;28(9):2423–37.

    Article  Google Scholar 

  10. 10.

    Costa JP, Stopar L, Fuart F, Grobelnik M, Santanam R, Sun C, Carlin P, Black M, Wallace J. Mining medline for the visualisation of a global perspective on biomedical knowledge. In: KDD 2018 (24th ACM SIGKDD Conference on Knowledge Discovery and Data Mining); 2018.

  11. 11.

    Escudié JB, Rance B, Malamut G, Khater S, Burgun A, Cellier C, Jannot AS. A novel data-driven workflow combining literature and electronic health records to estimate comorbidities burden for a specific disease: a case study on autoimmune comorbidities in patients with celiac disease. BMC Med Inf Decis Mak. 2017;17(1):140.

    Article  Google Scholar 

  12. 12.

    Goh WP, Tao X, Zhang J, Yong J. Decision support systems for adoption in dental clinics: a survey. Knowl Based Syst. 2016;104:195–206.

    Article  Google Scholar 

  13. 13.

    Hanauer DA, Saeed M, Zheng K, Mei Q, Shedden K, Aronson AR, Ramakrishnan N. Applying metamap to medline for identifying novel associations in a large clinical dataset: a feasibility analysis. J Am Med Inf Assoc. 2014;21(5):925–37.

    Article  Google Scholar 

  14. 14.

    Hidalgo CA, Blumm N, Barabási AL, Christakis NA. A dynamic network approach for the study of human phenotypes. PLoS Comput Biol. 2009;5(4):e1000353.

    Article  Google Scholar 

  15. 15.

    Huang Z, Yang J, van Harmelen F, Hu Q. Constructing knowledge graphs of depression. In: International Conference on Health Information Science. Springer; 2017. pp. 149–161.

  16. 16.

    Ji M, Han J, Danilevsky M. Ranking-based classification of heterogeneous information networks. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM; 2011. pp. 1298–1306.

  17. 17.

    Jiang Y, Qiu B, Xu C, Li C. The research of clinical decision support system based on three-layer knowledge base model. J Healthcare Eng. (2017).

  18. 18.

    Kavuluru R, Han S, Harris D. Unsupervised extraction of diagnosis codes from emrs using knowledge-based and extractive text summarization techniques. In: Canadian conference on artificial intelligence. Springer; 2013. pp. 77–88.

  19. 19.

    Lei X, Zhang Y. Predicting disease-genes based on network information loss and protein complexes in heterogeneous network. Inf Sci. 2019;479:386–400.

    Article  Google Scholar 

  20. 20.

    Liu YI, Wise PH, Butte AJ. The “etiome”: identification and clustering of human disease etiological factors. In: BMC bioinformatics. vol. 10, p. S14. BioMed Central; 2009.

  21. 21.

    Luo C, Guan R, Wang Z, Lin C. Hetpathmine: A novel transductive classification algorithm on heterogeneous information networks. In: European Conference on Information Retrieval. Springer; 2014. pp. 210–221.

  22. 22.

    Luo G. Automatically explaining machine learning prediction results: a demonstration on type 2 diabetes risk prediction. Health Inf Sci Syst. 2016;4(1):2.

    Article  Google Scholar 

  23. 23.

    Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems; 2013. pp. 3111–3119.

  24. 24.

    Pereira S, Névéol A, Massari P, Joubert M, Darmoni S. Construction of a semi-automated icd-10 coding help system to optimize medical and economic coding. In: MIEl; 2006. pp. 845–850.

  25. 25.

    Perotte A, Ranganath R, Hirsch JS, Blei D, Elhadad N. Risk prediction for chronic kidney disease progression using heterogeneous electronic health record data and time series analysis. J Am Med Inf Assoc. 2015;22(4):872–80.

    Article  Google Scholar 

  26. 26.

    Pham T, Tao X, Zhanag J, Yong J, Zhang W, Cai Y. Mining heterogeneous information graph for health status classification. In: 2018 5th International Conference on Behavioral, Economic, and Socio-Cultural Computing (BESC). IEEE; 2018. pp. 73–78.

  27. 27.

    Saitwal H, Qing D, Jones S, Bernstam EV, Chute CG, Johnson TR. Cross-terminology mapping challenges: a demonstration using medication terminological systems. J Biomed Inform. 2012;45(4):613–25.

    Article  Google Scholar 

  28. 28.

    Schriml LM, Mitraka E, Munro J, Tauber B, Schor M, Nickle L, Felix V, Jeng L, Bearer C, Lichenstein R, et al. Human disease ontology 2018 update: classification, content and workflow expansion. Nucleic Acids Res. 2018;47(D1):D955–62.

    Article  Google Scholar 

  29. 29.

    Shah S, Luo X, Kanakasabai S, Tuason R, Klopper G. Neural networks for mining the associations between diseases and symptoms in clinical notes. Health Inf Sci Syst. 2019;7(1):1.

    Article  Google Scholar 

  30. 30.

    Shakeel PM, Baskar S, Dhulipala VS, Jaber MM. Cloud based framework for diagnosis of diabetes mellitus using k-means clustering. Health Inf Sci Syst. 2018;6(1):16.

    Article  Google Scholar 

  31. 31.

    Soualmia LF, Sakji S, Letord C, Rollin L, Massari P, Darmoni SJ. Improving information retrieval with multiple health terminologies in a quality-controlled gateway. Health Inf Sci Syst. 2013;1(1):8.

    Article  Google Scholar 

  32. 32.

    Srinivasan S, Rindflesch TC, Hole WT, Aronson AR, Mork JG. Finding umls metathesaurus concepts in medline. In: Proceedings of the AMIA Symposium. p. 727. American Medical Informatics Association; 2002.

  33. 33.

    Sun Y, Han J. Mining heterogeneous information networks: a structural analysis approach. Acm Sigkdd Explorations Newsl. 2013;14(2):20–8.

    Article  Google Scholar 

  34. 34.

    Supriya S, Siuly S, Wang H, Cao J, Zhang Y. Weighted visibility graph with complex network features in the detection of epilepsy. IEEE Access. 2016;4:6554–66.

    Article  Google Scholar 

  35. 35.

    Tateisi Y. Resources for assigning mesh IDs to Japanese medical terms. Genomics Inform. 2019;17(2):e16.

    Article  Google Scholar 

  36. 36.

    Wang H, Zhang Q, Yuan J. Semantically enhanced medical information retrieval system: a tensor factorization based approach. IEEE Access. 2017;5:7584–93.

    Article  Google Scholar 

  37. 37.

    Wang L, Del Fiol G, Bray BE, Haug PJ. Generating disease-pertinent treatment vocabularies from medline citations. J Biomed Inform. 2017;65:46–57.

    Article  Google Scholar 

  38. 38.

    Xiong Y, Ruan L, Guo M, Tang C, Kong X, Zhu Y, Wang W. Predicting disease-related associations by heterogeneous network embedding. In: 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE; 2018. pp. 548–555.

  39. 39.

    Xu R, Li L, Wang Q. driskkb: a large-scale disease-disease risk relationship knowledge base constructed from biomedical text. BMC Bioinform. 2014;15(1):105.

    MathSciNet  Article  Google Scholar 

  40. 40.

    Xu R, Wang Q. Large-scale extraction of accurate drug-disease treatment pairs from biomedical literature for drug repurposing. BMC Bioinform. 2013;14(1):181.

    Article  Google Scholar 

  41. 41.

    Xu R, Wang Q. Toward creation of a cancer drug toxicity knowledge base: automatically extracting cancer drug-side effect relationships from the literature. J Am Med Inf Assoc. 2013;21(1):90–6.

    Article  Google Scholar 

  42. 42.

    Zeng Q, Cimino JJ. Automated knowledge extraction from the umls. In: Proceedings of the AMIA Symposium. p. 568. American Medical Informatics Association; 1998.

  43. 43.

    Zhang Y, Srimani PK, Wang JZ. Combining mesh thesaurus with umls in pseudo relevance feedback to improve biomedical information retrieval. In: 2016 IEEE International Conference on Knowledge Engineering and Applications (ICKEA). IEEE; 2016. pp. 67–71.

  44. 44.

    Zhao D, Weng C. Combining pubmed knowledge and ehr data to develop a weighted bayesian network for pancreatic cancer prediction. J Biomed Inform. 2011;44(5):859–68.

    Article  Google Scholar 

  45. 45.

    Zheng G, Callan J. Learning to reweight terms with distributed representations. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. ACM; 2015. pp. 575–584.

Download references


The work is conducted with approval from the Human Research Ethics Committee of the University of Southern Queensland, Australia (Approval ID: H18REA049). The authors acknowledge the use of the National Health and Nutrition Examination Survey (NHANES) and National Ambulatory Medical Care Survey (NAMCS) in the study and especially, thank the Centers for Disease Control and Prevention of the Department of Health and Human Services, the United States for making the data set publicly available for research purpose. The authors also appreciate the courtesy of the U.S. National Library of Medicine for allowing the use of MEDLINE.

Author information



Corresponding author

Correspondence to Thuan Pham.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Pham, T., Tao, X., Zhang, J. et al. Constructing a knowledge-based heterogeneous information graph for medical health status classification. Health Inf Sci Syst 8, 10 (2020).

Download citation


  • Knowledge graph
  • Electronic health data
  • Classification
  • Healthcare