MNAR Imputation with Distributed Healthcare Data

Pereira, Ricardo Cardoso; Santos, Miriam Seoane; Rodrigues, Pedro Pereira; Abreu, Pedro Henriques

doi:10.1007/978-3-030-30244-3_16

MNAR Imputation with Distributed Healthcare Data

Ricardo Cardoso Pereira¹¹,
Miriam Seoane Santos^11,12,
Pedro Pereira Rodrigues¹³ &
…
Pedro Henriques Abreu¹¹

Conference paper
First Online: 30 August 2019

1751 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11805))

Abstract

Missing data is a problem found in real-world datasets that has a considerable impact on the learning process of classifiers. Although extensive work has been done in this field, the MNAR mechanism still remains a challenge for the existing imputation methods, mainly because it is not related with any observed information. Focusing on healthcare contexts, MNAR is present in multiple scenarios such as clinical trials where the participants may be quitting the study for reasons related to the outcome that is being measured. This work proposes an approach that uses different sources of information from the same healthcare context to improve the imputation quality and classification performance for datasets with missing data under MNAR. The experiment was performed with several databases from the medical context and the results show that the use of multiple sources of data has a positive impact in the imputation error and classification performance.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
Available at https://pandas.pydata.org/.
2.
Available at https://scikit-learn.org/stable/.
3.
Available at https://github.com/iskandr/fancyimpute.
4.
Available at https://archive.ics.uci.edu/ml/datasets.html.
5.
Available at https://www.kaggle.com/.
6.
Although the Nemenyi test p-values are two-tailed, it is possible to ensure that these results always reflect improvement in the F1 scores by cross-analyzing them with the ones from Table 4.

References

Abreu, P.H., Amaro, H., Silva, D.C., Machado, P., Abreu, M.H.: Personalizing breast cancer patients with heterogeneous data. In: Zhang, Y.-T. (ed.) The International Conference on Health Informatics. IP, vol. 42, pp. 39–42. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-03005-0_11
Chapter Google Scholar
Azur, M.J., Stuart, E.A., Frangakis, C., Leaf, P.J.: Multiple imputation by chained equations: what is it and how does it work? Int. J. Methods Psychiatr. Res. 20(1), 40–49 (2011)
Article Google Scholar
Baraldi, A.N., Enders, C.K.: An introduction to modern missing data analyses. J. Sch. Psychol. 48(1), 5–37 (2010)
Article Google Scholar
Costa, A.F., Santos, M.S., Soares, J.P., Abreu, P.H.: Missing data imputation via denoising autoencoders: the untold story. In: Duivesteijn, W., Siebes, A., Ukkonen, A. (eds.) IDA 2018. LNCS, vol. 11191, pp. 87–98. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01768-2_8
Chapter Google Scholar
Garciarena, U., Santana, R.: An extensive analysis of the interaction between missing data types, imputation methods, and supervised classifiers. Expert Syst. Appl. 89, 52–65 (2017)
Article Google Scholar
Hastie, T., Mazumder, R., Lee, J.D., Zadeh, R.: Matrix completion and low-rank SVD via fast alternating least squares. J. Mach. Learn. Res. 16(1), 3367–3402 (2015)
MathSciNet MATH Google Scholar
van Kuijk, S.M., Viechtbauer, W., Peeters, L.L., Smits, L.: Bias in regression coefficient estimates when assumptions for handling missing data are violated: a simulation study. Epidemiol. Biostat. Public Health 13(1), e11598-1–e11598-8 (2016)
Google Scholar
Olsen, I., Kvien, T., Uhlig, T.: Consequences of handling missing data for treatment response in osteoarthritis: a simulation study. Osteoarthritis Cartilage 20(8), 822–828 (2012)
Article Google Scholar
Santos, M.S., Abreu, P.H., García-Laencina, P.J., Simão, A., Carvalho, A.: A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients. J. Biomed. Inform. 58, 49–59 (2015)
Article Google Scholar
Santos, M.S., Soares, J.P., Henriques Abreu, P., Araújo, H., Santos, J.: Influence of data distribution in missing data imputation. In: ten Teije, A., Popow, C., Holmes, J.H., Sacchi, L. (eds.) AIME 2017. LNCS (LNAI), vol. 10259, pp. 285–294. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59758-4_33
Chapter Google Scholar
Smola, A.J., Schölkopf, B.: A tutorial on support vector regression. Stat. Comput. 14(3), 199–222 (2004)
Article MathSciNet Google Scholar
Valdiviezo, H.C., Van Aelst, S.: Tree-based prediction on incomplete data using imputation or surrogate decisions. Inf. Sci. 311, 163–181 (2015)
Article Google Scholar
Wolkowitz, A.A., Skorupski, W.P.: A method for imputing response options for missing data on multiple-choice assessments. Educ. Psychol. Measur. 73(6), 1036–1053 (2013)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics Engineering, Centre for Informatics and Systems of the University of Coimbra, 3030-290, Coimbra, Portugal
Ricardo Cardoso Pereira, Miriam Seoane Santos & Pedro Henriques Abreu
The IPO-Porto Research Centre, 4200-072, Porto, Portugal
Miriam Seoane Santos
Faculty of Medicine of the University of Porto, Center for Health Technology and Services Research, 4200-319, Porto, Portugal
Pedro Pereira Rodrigues

Authors

Ricardo Cardoso Pereira
View author publications
You can also search for this author in PubMed Google Scholar
Miriam Seoane Santos
View author publications
You can also search for this author in PubMed Google Scholar
Pedro Pereira Rodrigues
View author publications
You can also search for this author in PubMed Google Scholar
Pedro Henriques Abreu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ricardo Cardoso Pereira .

Editor information

Editors and Affiliations

INESC-TEC, University of Trás-os-Montes and Alto Douro, Vila Real, Portugal
Paulo Moura Oliveira
University of Minho, Braga, Portugal
Paulo Novais
LIACC/UP, University of Porto, Porto, Portugal
Luís Paulo Reis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pereira, R.C., Santos, M.S., Rodrigues, P.P., Abreu, P.H. (2019). MNAR Imputation with Distributed Healthcare Data. In: Moura Oliveira, P., Novais, P., Reis, L. (eds) Progress in Artificial Intelligence. EPIA 2019. Lecture Notes in Computer Science(), vol 11805. Springer, Cham. https://doi.org/10.1007/978-3-030-30244-3_16

Download citation

DOI: https://doi.org/10.1007/978-3-030-30244-3_16
Published: 30 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30243-6
Online ISBN: 978-3-030-30244-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics