Crowdsourcing-Enhanced Missing Values Imputation Based on Bayesian Network

Ye, Chen; Wang, Hongzhi; Li, Jianzhong; Gao, Hong; Cheng, Siyao

doi:10.1007/978-3-319-32025-0_5

Chen Ye¹⁹,
Hongzhi Wang¹⁹,
Jianzhong Li¹⁹,
Hong Gao¹⁹ &
…
Siyao Cheng¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9642))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

3701 Accesses
9 Citations

Abstract

Due to development of the Internet, the size of data continue to be large and rough. During the process of data collection, different kinds of data problems occurred, among where incompleteness is one of the most serious problems to deal with. The existing methods for missing values imputation have mostly relied on using statistics and machine learning. These methods are known to be limited in efficiency and accuracy, which are caused by high dimensional calculation and low quality of initial data. In this paper, we propose a new method combining Bayesian network and crowdsourcing to deal with missing values together. We use Bayesian network to inference missing values to improve efficiency while use crowdsourcing to obtain additional information in need to improve accuracy. Experiments on real datasets show that our methods achieve better performance compared to other imputation methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Janssen, K.J.M., Donders, A.R.T., Harrell, F.E., et al.: Missing covariate data in medical research: to impute is better than to ignore. J. Clin. Epidemiol. 63(7), 721–727 (2010)
Article Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodological) 39, 1–38 (2011)
MathSciNet MATH Google Scholar
Shan, Y., Kernel, D.G.: PCA regression for missing data estimation in DNA microarray analysis. In: IEEE International Symposium on Circuits and Systems, ISCAS 2009, pp. 1477–1480. IEEE (2009)
Google Scholar
Lakshminarayan, K., Harp, S.A., Goldman, R.P., et al.: Imputation of missing data using machine learning techniques. In: KDD, pp. 140–145 (1996)
Google Scholar
Yang, K., Li, J., Wang, C.: Missing values estimation in microarray data with partial least squares regression. In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2006. LNCS, vol. 3992, pp. 662–669. Springer, Heidelberg (2006)
Chapter Google Scholar
Li, X.B.: A Bayesian approach for estimating and replacing missing categorical data. J. Data Inf. Qual. (JDIQ) 1(1), 3 (2009)
Google Scholar
Di Zio, M., Scanu, M., Coppola, L., et al.: Bayesian networks for imputation. J. R. Stat. Soc. Ser. A (Statistics in Society) 167(2), 309–322 (2004)
Article MathSciNet Google Scholar
Zhang, S.: Shell-neighbor method and its application in missing data imputation. Appl. Intell. 35(1), 123–133 (2011)
Article Google Scholar
Setiawan, N.A., Venkatachalam, P.A., Hani, A.F.M.: Missing attribute value prediction based on artificial neural network and rough set theory. In: International Conference on BioMedical Engineering and Informatics, BMEI 2008, vol. 1, pp. 306–310. IEEE (2008)
Google Scholar
Nowak, S., Rger, S.: How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation. In: Proceedings of the International Conference on Multimedia Information Retrieval, pp. 557–566. ACM (2010)
Google Scholar
Noronha, J., Hysen, E., Zhang, H., et al.: Platemate: crowdsourcing nutritional analysis from food photographs. In: Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, pp. 1–12. ACM (2011)
Google Scholar
Whang, S.E., Lofgren, P., Garcia-Molina, H.: Question selection for crowd entity resolution. Proc. VLDB Endowment 6(6), 349–360 (2013)
Article Google Scholar
Wang, J., Kraska, T., Franklin, M.J., et al.: Crowder: crowdsourcing entity resolution. Proc. VLDB Endowment 5(11), 1483–1494 (2012)
Article Google Scholar
Zhang, C.J., Chen, L., Jagadish, H.V., et al.: Reducing uncertainty of schema matching via crowdsourcing. Proc. VLDB Endowment 6(9), 757–768 (2013)
Article Google Scholar
Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, SanMateo (1988)
MATH Google Scholar
Lawrence, I., Lin, K.: A concordance correlation coefficient to evaluate reproducibility. Biometrics 45, 255–268 (1989)
Article MATH Google Scholar
Stekhoven, D.J., Bhlmann, P.: MissForestnon-parametric missing value imputation for mixed-type data. Bioinformatics 28(1), 112–118 (2012)
Article Google Scholar
Cooper, G.F., Herskovits, E.: A Bayesian method for the induction of probabilistic networks from data. Mach. Learn. 9(4), 309–347 (1992)
MATH Google Scholar
Tsamardinos, I., Brown, L.E., Aliferis, C.F.: The max-min hill-climbing Bayesian network structure learning algorithm. Mach. Learn. 65(1), 31–78 (2006)
Article Google Scholar
Huang, C., Darwiche, A.: Inference in belief networks: a procedural guide. Int. J. Approximate Reasoning 15(3), 225–263 (1996)
Article MathSciNet MATH Google Scholar
Lauritzen, S.L.: The EM algorithm for graphical association models with missing data. Comput. Stat. Data Anal. 19(2), 191–201 (1995)
Article MATH Google Scholar
Hochbaum, D.S.: Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems. In: Approximation Algorithms for NP-Hard Problems, pp. 94–143. PWS Publishing Co. (1996)
Google Scholar
Li, J., Cai, Z., Yan, M., Li, Y.: Using crowdsourced data in location-based social networks to explore influence maximization. In: The 35th Annual IEEE International Conference on Computer Communications (INFOCOM 2016) (2016)
Google Scholar
Wang, Y., Cai, Z., Stothard, P., et al.: Fast accurate missing SNP genotype local imputation. BMC Res. Notes 5(1), 404 (2012)
Article Google Scholar
Cai, Z., Heydari, M., Lin, G.: Iterated local least squares imputation for microarray missing values. J. Bioinform. Comput. Biol. 4(5), 935–957 (2006)
Article Google Scholar

Download references

Acknowledgement

This paper was supported by NGFR 973 grant 2012CB316200, NSFC grant U1509216, 61472099, 61133002 and National Sci-Tech Support Plan 2015BAH10F01.

Author information

Authors and Affiliations

Department of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
Chen Ye, Hongzhi Wang, Jianzhong Li, Hong Gao & Siyao Cheng

Authors

Chen Ye
View author publications
You can also search for this author in PubMed Google Scholar
Hongzhi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jianzhong Li
View author publications
You can also search for this author in PubMed Google Scholar
Hong Gao
View author publications
You can also search for this author in PubMed Google Scholar
Siyao Cheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chen Ye .

Editor information

Editors and Affiliations

Georgia Institute of Technology, Atlanta, Georgia, USA
Shamkant B. Navathe
University of Texas at Dallas, Richardson, Texas, USA
Weili Wu
University of Minnesota, Minneapolis, Minnesota, USA
Shashi Shekhar
Renmin University, Beijing, China
Xiaoyong Du
Fudan University, Shanghai, China
X. Sean Wang
Rutgers, The State University of New Jer, New Brunswick, New Jersey, USA
Hui Xiong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ye, C., Wang, H., Li, J., Gao, H., Cheng, S. (2016). Crowdsourcing-Enhanced Missing Values Imputation Based on Bayesian Network. In: Navathe, S., Wu, W., Shekhar, S., Du, X., Wang, X., Xiong, H. (eds) Database Systems for Advanced Applications. DASFAA 2016. Lecture Notes in Computer Science(), vol 9642. Springer, Cham. https://doi.org/10.1007/978-3-319-32025-0_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-32025-0_5
Published: 25 March 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-32024-3
Online ISBN: 978-3-319-32025-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics