Abstract
Given the continuous growth of large-scale complex electronic healthcare data, a data-driven healthcare cohort discovery facilitated by machine learning tools with domain expert knowledge is required to gain further insights of the healthcare system. Specifically, clustering plays a crucial role in healthcare cohort discovery, and metric learning is able to incorporate expert feedback to generate more fit-for-purpose clustering outputs. However, most of the existing metric learning methods assume all labelled instances already pre-exists, which is not always true in real-world applications. In addition, big data in healthcare also brings new challenges to metric learning on handling complex structured data. In this paper, we propose a novel systematic method, namely Interactive Deep Metric Learning (IDML), which uses an interactive process to iteratively incorporate feedback from domain experts to identify cohorts that are more relevant to a particular pre-defined purpose. Moreover, the proposed method leverages powerful deep learning-based embedding techniques to incrementally gain effective representations for the complex structures inherit in patient journey data. We experimentally evaluate the effectiveness of the proposed IDML using two public healthcare datasets. The proposed method has also been implemented into an interactive cohort discovery tool for a real-world application in healthcare.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Angluin, D.: Queries and concept learning. Mach. Learn. 2(4), 319–342 (1988)
Awasthi, P., Balcan, M.F., Voevodski, K.: Local algorithms for interactive clustering. J. Mach. Learn. Res. 18(1), 75–109 (2017)
Balcan, M.-F., Blum, A.: Clustering with interactive feedback. In: Freund, Y., Györfi, L., Turán, G., Zeugmann, T. (eds.) ALT 2008. LNCS (LNAI), vol. 5254, pp. 316–328. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87987-9_27
Balcan, M.F., Liang, Y., Gupta, P.: Robust hierarchical clustering. J. Mach. Learn. Res. 15(1), 3831–3871 (2014)
Brainard, W.C.: Uncertainty and the effectiveness of policy. Am. Econ. Rev. 57(2), 411–425 (1967)
Choi, E., et al.: Doctor AI: predicting clinical events via recurrent neural networks. In: Machine Learning for Healthcare Conference, pp. 301–318 (2016)
Choi, E., et al.: Multi-layer representation learning for medical concepts. In: SIGKDD, pp. 1495–1504. ACM (2016)
Choi, E., et al.: RETAIN: an interpretable predictive model for healthcare using reverse time attention mechanism. In: NIPS, pp. 3504–3512 (2016)
Choi, Y., Chiu, C.Y.I., Sontag, D.: Learning low-dimensional representations of medical concepts. AMIA Jt. Summits Transl. Sci. Proc. 2016, 41 (2016)
cms.gov: CMS 2008–2010 data entrepreneurs’ synthetic public use file (2015)
Davis, J.V., Kulis, B., Jain, P., Sra, S., Dhillon, I.S.: Information-theoretic metric learning. In: ICML, pp. 209–216. ACM (2007)
Goldberger, J., Hinton, G.E., Roweis, S.T., Salakhutdinov, R.R.: Neighbourhood components analysis. In: NIPS, pp. 513–520 (2005)
Hinton, G.E., Roweis, S.T.: Stochastic neighbor embedding. In: NIPS, pp. 857–864 (2003)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Jensen, P.B., Jensen, L.J., Brunak, S.: Mining electronic health records: towards better research applications and clinical care. Nat. Rev. Gen. 13(6), 395 (2012)
Johnson, A.E., et al.: MIMIC-III, a freely accessible critical care database. Sci. Data 3, 160035 (2016)
Jolliffe, I.: Principal component analysis for special types of data. In: Jolliffe, I. (ed.) Principal Component Analysis, pp. 338–372. Springer, New York (2002). https://doi.org/10.1007/0-387-22440-8_13
Lipton, Z.C., Kale, D.C., Elkan, C., Wetzel, R.: Learning to diagnose with LSTM recurrent neural networks. arXiv preprint arXiv:1511.03677 (2015)
Meystre, S., et al.: Clinical data reuse or secondary use: current status and potential future progress (2017)
Mikolov, T., et al.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013)
Mikolov, T., et al.: Efficient estimation of word representations in vector space. arXiv:1301.3781 (2013)
Miotto, R., Li, L., Kidd, B.A., Dudley, J.T.: Deep patient: an unsupervised representation to predict the future of patients from the electronic health records. Sci. Rep. 6, 26094 (2016)
Peng, X., Long, G., Pan, S., Jiang, J., Niu, Z.: Attentive dual embedding for understanding medical concepts in electronic health records. In: IJCNN, pp. 1–8 (2019)
Peng, X., Long, G., Shen, T., Wang, S., Jiang, J., Blumenstein, M.: Temporal self-attention network for medical concept embedding. arXiv preprint arXiv:1909.06886 (2019)
Schoen, C., Osborn, R., Doty, M.M., Squires, D., Peugh, J., Applebaum, S.: A survey of primary care physicians in eleven countries, 2009: perspectives on care, costs, and experiences: doctors say problems exist across all eleven countries, although some nations are doing a better job than others. Health Aff. 28(Suppl1), w1171–w1183 (2009)
Suo, Q., et al.: Deep patient similarity learning for personalized healthcare. IEEE T NANOBIOSCI 17(3), 219–227 (2018)
Wang, F.: Semisupervised metric learning by maximizing constraint margin. Cybernetics 41(4), 931–939 (2011)
Wang, F., Sun, J.: PSF: a unified patient similarity evaluation framework through metric learning with weak supervision. IEEE J. Biomed. Health Inform. 19(3), 1053–1060 (2015)
Wang, F., Sun, J., Hu, J., Ebadollahi, S.: iMet: interactive metric learning in healthcare applications. In: SDM, pp. 944–955. SIAM (2011)
Weinberger, K.Q., Saul, L.K.: Distance metric learning for large margin nearest neighbor classification. JMLR 10(Feb), 207–244 (2009)
Weiskopf, N.G., et al.: Defining and measuring completeness of electronic health records for secondary use. J. Biomed. Inform. 46(5), 830–836 (2013)
Weiskopf, N.G., Weng, C.: Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. JAMIA 20(1), 144–151 (2013)
Xing, E.P., Jordan, M.I., Russell, S.J., Ng, A.Y.: Distance metric learning with application to clustering with side-information. In: NIPS, pp. 521–528 (2003)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Appendices
A LSTM Formulation
where \(i_t,f_t, o_t\) are input, forget and output gates respectively. The gates are different neural networks that decide which information is allowed on the cell state. The gates can learn what information is relevant to keep or forget during training. \(v_t\) is an input to a network at time t, \(h_{t-1}\) is an output at time \(t-1\) and \({c}_{t-1}\) is an internal cell state at \(t-1\).
B Clustering Evaluation Measures
-
The Normalized Mutual Information (NMI) is defined as:
$$\begin{aligned} NMI (\widehat{K};K) = \dfrac{2\times I(\widehat{K};K)}{\left[ H( \widehat{K}) +H(K) \right] } \end{aligned}$$(7)where \(I(\widehat{K};K)\) is the mutual information and the entropies \(H(\widehat{K})\) and H(K) are used for normalizing the mutual information to be in the range of [0, 1]. The higher the NMI is, the better the corresponding clustering is.
-
The Adjusted Rand Index (ARI) of clustering is defined as:
$$\begin{aligned} ARI = \dfrac{RI - E[RI]}{max(RI) - E[RI]} \end{aligned}$$(8)where \(RI = \dfrac{a+b}{C^{N}_{2}}\), a is the number of patient pairs coming from the same cohort and also grouped into same cluster, b is the number of patient pairs belonging to different cohorts and grouped into different clusters. N is the total number of patients. The range of ARI values is [−1, 1]. The higher the ARI is, the better the corresponding clustering is.
-
The Purity of clustering is defined as:
$$\begin{aligned} Purity (\widehat{K},K) = \dfrac{1}{N}\sum ^{|\widehat{K}|}_{i=1}\max _{j}\left| \widehat{k_{i}}\cap k_{j}\right| , \end{aligned}$$(9)where \(\widehat{K}=\{\widehat{k_{1}}, \widehat{k_{2}},\dots , \widehat{k_{|\widehat{K}|}}\}\) is the set of clusters produced by the chosen clustering algorithms. \(|\widehat{K}|\) is the total number of clusters. \(K=\{k_{1}, k_{2},\dots , k_{|K|}\}\) is the group of patient cohorts (i.e. ground truth). \(\max _{j}\left| \widehat{k_{i}}\cap k_{j}\right| \) is the size of the intersection of cluster \(\widehat{k_{i}}\) and patient cohort \(k_{j}\) which is most frequent inside. The range of Purity values is [0, 1]. The higher the Purity is, the better the corresponding clustering is.
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Wang, Y., Long, G., Peng, X., Clarke, A., Stevenson, R., Gerrard, L. (2019). Interactive Deep Metric Learning for Healthcare Cohort Discovery. In: Le, T., et al. Data Mining. AusDM 2019. Communications in Computer and Information Science, vol 1127. Springer, Singapore. https://doi.org/10.1007/978-981-15-1699-3_17
Download citation
DOI: https://doi.org/10.1007/978-981-15-1699-3_17
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-1698-6
Online ISBN: 978-981-15-1699-3
eBook Packages: Computer ScienceComputer Science (R0)