Cleaning Missing Data Based on the Bayesian Network

Duan, Liang; Yue, Kun; Qian, Wenhua; Liu, Weiyi

doi:10.1007/978-3-642-39527-7_34

Liang Duan²⁴,
Kun Yue²⁴,
Wenhua Qian²⁴ &
…
Weiyi Liu²⁴

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7901))

Included in the following conference series:

International Conference on Web-Age Information Management

1531 Accesses
2 Citations

Abstract

To guarantee the data quality, it is necessary to clean the missing data that prevalently exist in real world databases. By incorporating additional information, such as functional dependencies or integrity constraints, the correct value for each missing data item can be derived in many existing data cleaning methods. In this paper, we propose a method for cleaning the missing data item without additional information by adopting Bayesian network (BN) as the framework of the representation and inferences of probability distributions. First, we learn a Bayesian network from the complete part of the given incomplete database, called IBN. Then, we infer the probability distributions of each missing data item based on Gibbs sampling upon the IBN. Consequently, we obtain all possible values with their corresponding probability distributions (i.e., confidence degrees), by which we clean the incomplete databases. Experimental results showed the efficiency, accuracy and precision of our methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Muller, H., Freytag, J.C.: Problems, Methods, and Challenges in Comprehensive Data Cleansing. Technical report, Humboldt-Universitat zu Berlin (2003)
Google Scholar
Arasu, A., Chaudhuri, S., Chen, Z., Ganjam, K., et al.: Experiences with using Data Cleaning Technology for Bing Services. IEEE Data Engineering Bulletin, 14–23 (2012)
Google Scholar
Beskales, G., Ilyas, I.F., Golab, L.: Sampling the repairs of functional dependency violations under hard constraints. PVLDB 3(1), 197–207 (2010)
Google Scholar
Bohannon, P., Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional Functional Dependencies for Data Cleaning. In: Chirkova, R., Dogac, A., Ozsu, M.T., Sellis, T.K. (eds.) Proc. of ICDE 2007, Istanbul, Turkey, pp. 746–755. IEEE Computer Society (2007)
Google Scholar
Chen, H., Ku, W.S., Wang, H.: Cleansing Uncertain Databases Leveraging Aggregate Constraints. In: Workshops Proc. of ICDE 2010, California, USA, pp. 128–135. IEEE Computer Society (2010)
Google Scholar
Srivastava, D.: Analyzing Data Quality Using Data Auditor. In: Chen, L., Tang, C., Yang, J., Gao, Y. (eds.) WAIM 2010. LNCS, vol. 6184, pp. 1–1. Springer, Heidelberg (2010)
Chapter Google Scholar
Mayfield, C., Neville, J., Prabhakar, S.: ERACER: A Database Approach for Statistical Inference and Data Cleaning. In: Elmagarmid, A.K., Agrawal, D. (eds.) Proc. of SIGMOD 2010, Indiana, USA, pp. 75–86. ACM (2010)
Google Scholar
Stoyanovich, J., Davidson, S., Milo, T., Tannen, V.: Deriving Probabilistic Databases with Inference Ensembles. In: Abiteboul, S., Bohm, K., Koch, C., Tan, K.L. (eds.) Proc. of ICDE 2011, Hannover, Germany, pp. 303–314. IEEE Computer Society (2011)
Google Scholar
Darwiche, A.: Modeling and Reasoning with Bayesian Networks. Cambridge University Press (2009)
Google Scholar
Cheng, J., Greiner, R., Bell, D., Liu, W.: Learning Bayesian Networks from Data: An Efficient Approach Based on Information Theory. Artificial Intelligence 137(1-2), 43–90 (2002)
Article MathSciNet MATH Google Scholar
Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach, 3rd edn. Prentice Hall (2009)
Google Scholar
Cavallo, R., Pittarelli, M.: The Theory of Probabilistic Databases. In: Stocker, P.M., Kent, W., Hammersley, P. (eds.) Proc. of VLDB 1987, Brighton, England, pp. 71–81. Morgan Kaufmann (1987)
Google Scholar
Huang, J., Antova, L., Koch, C., Olteanu, D.: MayBMS: A Probabilistic Databases Management System. In: Cetintemel, U., Zdonik, S.B., Kossmann, D., Tatbul, N. (eds.) Proc. of SIGMOD 2009, Rhode Island, USA, pp. 1071–1074. ACM (2009)
Google Scholar
Benjelloun, O., Sarma, A., Halevy, A., Widom, J.: ULDBs: Databases with Uncertainty and Lineage. In: Dayal, U., Whang, K.Y., Lomet, D.B., Alonso, G.A., Lohman, G.M., Kersten, M.L., Cha, S.K., Kim, Y.K. (eds.) Proc. of VLDB 2006, Seoul, Korea, pp. 953–964. Morgan Kaufmann (2006)
Google Scholar
Norsys Software Corporation, http://www.norsys.com/
Cover, T., Thomas, J.: Elements of Information Theory. Wiley and Sons (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, 650091, Kunming, China
Liang Duan, Kun Yue, Wenhua Qian & Weiyi Liu

Authors

Liang Duan
View author publications
You can also search for this author in PubMed Google Scholar
Kun Yue
View author publications
You can also search for this author in PubMed Google Scholar
Wenhua Qian
View author publications
You can also search for this author in PubMed Google Scholar
Weiyi Liu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

College of Computer Science, Zhejiang University, Hangzhou, China
Yunjun Gao
Seoul National University, Seoul, Korea
Kyuseok Shim
Institute of Software, Chinese Academy of Sciences, South-Fourth-Street 4, Zhong-Guan-Cun, 100190, Beijing, P.R. China
Zhiming Ding
School of Computer Science and Technology, University of Science and Technology of China, 230027, Hefei, China
Peiquan Jin
School of Computer Science and Technology, Hangzhou Dianzi University, 310018, Hangzhou, China
Zujie Ren
Key Laboratory of Intelligence Computing and Novel Software Technology, Tianjin Key Laboratory of Computer Vision and System, Ministry of Education, Tianjin University of Technology, 300384, Tianjin, China
Yingyuan Xiao
CityU-USTC Advanced Research Institute, Suzhou, China
An Liu
School of Information Science and Technology, Southwest Jiaotong University, 610031, Chengdu, China
Shaojie Qiao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Duan, L., Yue, K., Qian, W., Liu, W. (2013). Cleaning Missing Data Based on the Bayesian Network. In: Gao, Y., et al. Web-Age Information Management. WAIM 2013. Lecture Notes in Computer Science, vol 7901. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39527-7_34

Download citation

DOI: https://doi.org/10.1007/978-3-642-39527-7_34
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-39526-0
Online ISBN: 978-3-642-39527-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics