Abstract
Measuring the completeness of a data population often requires either expert knowledge or the presence of reference data. If neither is available, measuring population completeness becomes nontrivial. We present the ForCE approach (Forecasting for Completeness Estimation), a method to estimate the completeness of timestamped data using time series forecasting. We evaluate the method’s feasibility using a medical domain real-world dataset, which we provide for download. The method is compared to three baselines. ForCE manages to surpass all three.
The original version of this chapter was revised: The authors corrected errors in the figures appearing in Sect. 3.2 and the Appendix and adjusted the text referring to the figures. An erratum to this chapter can be found at DOI: 10.1007/978-3-319-23135-8_32
An erratum to this chapter can be found at http://dx.doi.org/10.1007/978-3-319-23135-8_32
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
enterprise resource planning system.
- 2.
A “benefit” is any creditable treatment, counseling, or similar action a practitioner performs.
- 3.
Snippet of cleaned real-world data from a medical center. To exemplify our proposition, an artificial error has been introduced at data point 25.
- 4.
available for download at www6.cs.fau.de/files/completeness_data.zip.
- 5.
See r-project.org.
- 6.
If the classifier always guesses positive, all actual positives are caught.
References
Batini, C., Scannapieco, M.: Data Quality: Concepts Methodologies and Techniques. DCSA. Springer, Heidelberg (2006)
Dersch-Mills, D., Hugel, K., Nystrom, M.: Completeness of information sources used to prepare best possible medication histories for pediatric patients. Can. J. Hosp. Pharm. 64, 10–15 (2011)
Dugas, M., Dugas-Breit, S.: A generic method to monitor completeness and speed of medical documentation processes. Methods Inf. Med. 51(3), 252–257 (2012)
Dustdar, S., Pichler, R., Savenkov, V., Truong, H.L.: Quality-aware service-oriented data integration: requirements, state of the art and open challenges. SIGMOD rec. 41(1), 11–19 (2012)
Endler, G.: Data quality and integration in collaborative environments. In: Proceedings of the SIGMOD/PODS 2012 PhD Symposium, PhD 2012, pp. 21–26. ACM, New York (2012)
Endler, G., Baumgärtel, P., Lenz, R.: Pay-as-you-go data quality improvement for medical centers. In: Ammenwerth, E., Hörbst, A., Hayn, D., Schreier, G. (eds.) Proceedings of the eHealth2013 (2013)
Endler, G., Langer, M., Purucker, J., Lenz, R.: An evolutionary approach to IT support for medical supply centers. In: Proceedings der 41. Jahrestagung der Gesellschaft für Informatik e.V. (GI) (2011)
Endler, G., Schwab, P.K., Wahl, A.M., Tenschert, J., Lenz, R.: An architecture for continuous data quality monitoring in medical centers. In: MEDINFO 2015 (2015)
Fan, W., Geerts, F.: Foundations of Data Quality Management. Morgan & Claypool Publishers, San Rafael (2012)
Gorupec, M., Endler, G.: ruleDQ: Ein Regelsystem zur Datenqualitätsverbesserung medizinischer Informationssysteme. In: Gesellschaft für Informatik (ed.) Lecture Notes in Informatics (LNI) Seminars 13 / Informatiktage 2014, pp. 37–40 (2014)
Hyndman, R.J.: R package ’forecast’ - forecasting functions for time series and linear models. http://cran.r-project.org/web/packages/forecast/forecast.pdf (2015). Accessed on 14 April 2015
Kantz, H., Schreiber, T.: Nonlinear Time Series Analysis, vol. 7. Cambridge University Press, Cambridge (2004)
Miller, D.W., Yeast, J.D., Evans, R.L.: Missing prenatal records at a birth center: a communication problem quantified. In: AMIA Annual Symposium Proceedings of American Medical Informatics Association (2005)
Naumann, F., Freytag, J.C., Leser, U.: Completeness of integrated information sources. Inf. Syst. 29(7), 583–615 (2004)
Pipino, L.L., Lee, Y.W., Wang, R.Y.: Data quality assessment. Commun. ACM 45, 211–218 (2002)
Pollner, N., Steudtner, C., Meyer-Wegener, K.: Placement-safe operator-graph changes in distributed heterogeneous data stream systems. In: Datenbanksysteme für Business, Technologie und Web - Workshopband (2015)
Razniewski, S., Nutt, W.: Completeness of queries over incomplete databases. PVLDB 4(11), 749–760 (2011)
Redman, T.C.: Data Quality: The Field Guide. Digital Press, Newton (2001)
Scannapieco, M., Missier, P., Batini, C.: Data quality at a glance. Datenbank-Spektrum 14, 6–14 (2005)
Wang, R.Y., Ziad, M., Lee, Y.W.: Data Quality. ADS. Springer, New York (2002)
Zaniolo, C.: Database relations with null values. In: Proceedings of the 1st ACM SIGACT-SIGMOD Symposium on Principles of database systems, PODS 1982, pp. 27–33. ACM, New York (1982)
Acknowledgements
Parts of this work are supported by the German Federal Ministry of Education and Research (BMBF), grant No. 13EX1013D.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Endler, G., Baumgärtel, P., Wahl, A.M., Lenz, R. (2015). ForCE: Is Estimation of Data Completeness Through Time Series Forecasts Feasible?. In: Tadeusz, M., Valduriez, P., Bellatreche, L. (eds) Advances in Databases and Information Systems. ADBIS 2015. Lecture Notes in Computer Science(), vol 9282. Springer, Cham. https://doi.org/10.1007/978-3-319-23135-8_18
Download citation
DOI: https://doi.org/10.1007/978-3-319-23135-8_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23134-1
Online ISBN: 978-3-319-23135-8
eBook Packages: Computer ScienceComputer Science (R0)