Skip to main content

Crowdsourcing-Enhanced Missing Values Imputation Based on Bayesian Network

  • Conference paper
  • First Online:
Database Systems for Advanced Applications (DASFAA 2016)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9642))

Included in the following conference series:

Abstract

Due to development of the Internet, the size of data continue to be large and rough. During the process of data collection, different kinds of data problems occurred, among where incompleteness is one of the most serious problems to deal with. The existing methods for missing values imputation have mostly relied on using statistics and machine learning. These methods are known to be limited in efficiency and accuracy, which are caused by high dimensional calculation and low quality of initial data. In this paper, we propose a new method combining Bayesian network and crowdsourcing to deal with missing values together. We use Bayesian network to inference missing values to improve efficiency while use crowdsourcing to obtain additional information in need to improve accuracy. Experiments on real datasets show that our methods achieve better performance compared to other imputation methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Janssen, K.J.M., Donders, A.R.T., Harrell, F.E., et al.: Missing covariate data in medical research: to impute is better than to ignore. J. Clin. Epidemiol. 63(7), 721–727 (2010)

    Article  Google Scholar 

  2. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodological) 39, 1–38 (2011)

    MathSciNet  MATH  Google Scholar 

  3. Shan, Y., Kernel, D.G.: PCA regression for missing data estimation in DNA microarray analysis. In: IEEE International Symposium on Circuits and Systems, ISCAS 2009, pp. 1477–1480. IEEE (2009)

    Google Scholar 

  4. Lakshminarayan, K., Harp, S.A., Goldman, R.P., et al.: Imputation of missing data using machine learning techniques. In: KDD, pp. 140–145 (1996)

    Google Scholar 

  5. Yang, K., Li, J., Wang, C.: Missing values estimation in microarray data with partial least squares regression. In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2006. LNCS, vol. 3992, pp. 662–669. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  6. Li, X.B.: A Bayesian approach for estimating and replacing missing categorical data. J. Data Inf. Qual. (JDIQ) 1(1), 3 (2009)

    Google Scholar 

  7. Di Zio, M., Scanu, M., Coppola, L., et al.: Bayesian networks for imputation. J. R. Stat. Soc. Ser. A (Statistics in Society) 167(2), 309–322 (2004)

    Article  MathSciNet  Google Scholar 

  8. Zhang, S.: Shell-neighbor method and its application in missing data imputation. Appl. Intell. 35(1), 123–133 (2011)

    Article  Google Scholar 

  9. Setiawan, N.A., Venkatachalam, P.A., Hani, A.F.M.: Missing attribute value prediction based on artificial neural network and rough set theory. In: International Conference on BioMedical Engineering and Informatics, BMEI 2008, vol. 1, pp. 306–310. IEEE (2008)

    Google Scholar 

  10. Nowak, S., Rger, S.: How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation. In: Proceedings of the International Conference on Multimedia Information Retrieval, pp. 557–566. ACM (2010)

    Google Scholar 

  11. Noronha, J., Hysen, E., Zhang, H., et al.: Platemate: crowdsourcing nutritional analysis from food photographs. In: Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, pp. 1–12. ACM (2011)

    Google Scholar 

  12. Whang, S.E., Lofgren, P., Garcia-Molina, H.: Question selection for crowd entity resolution. Proc. VLDB Endowment 6(6), 349–360 (2013)

    Article  Google Scholar 

  13. Wang, J., Kraska, T., Franklin, M.J., et al.: Crowder: crowdsourcing entity resolution. Proc. VLDB Endowment 5(11), 1483–1494 (2012)

    Article  Google Scholar 

  14. Zhang, C.J., Chen, L., Jagadish, H.V., et al.: Reducing uncertainty of schema matching via crowdsourcing. Proc. VLDB Endowment 6(9), 757–768 (2013)

    Article  Google Scholar 

  15. Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, SanMateo (1988)

    MATH  Google Scholar 

  16. Lawrence, I., Lin, K.: A concordance correlation coefficient to evaluate reproducibility. Biometrics 45, 255–268 (1989)

    Article  MATH  Google Scholar 

  17. Stekhoven, D.J., Bhlmann, P.: MissForestnon-parametric missing value imputation for mixed-type data. Bioinformatics 28(1), 112–118 (2012)

    Article  Google Scholar 

  18. Cooper, G.F., Herskovits, E.: A Bayesian method for the induction of probabilistic networks from data. Mach. Learn. 9(4), 309–347 (1992)

    MATH  Google Scholar 

  19. Tsamardinos, I., Brown, L.E., Aliferis, C.F.: The max-min hill-climbing Bayesian network structure learning algorithm. Mach. Learn. 65(1), 31–78 (2006)

    Article  Google Scholar 

  20. Huang, C., Darwiche, A.: Inference in belief networks: a procedural guide. Int. J. Approximate Reasoning 15(3), 225–263 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  21. Lauritzen, S.L.: The EM algorithm for graphical association models with missing data. Comput. Stat. Data Anal. 19(2), 191–201 (1995)

    Article  MATH  Google Scholar 

  22. Hochbaum, D.S.: Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems. In: Approximation Algorithms for NP-Hard Problems, pp. 94–143. PWS Publishing Co. (1996)

    Google Scholar 

  23. Li, J., Cai, Z., Yan, M., Li, Y.: Using crowdsourced data in location-based social networks to explore influence maximization. In: The 35th Annual IEEE International Conference on Computer Communications (INFOCOM 2016) (2016)

    Google Scholar 

  24. Wang, Y., Cai, Z., Stothard, P., et al.: Fast accurate missing SNP genotype local imputation. BMC Res. Notes 5(1), 404 (2012)

    Article  Google Scholar 

  25. Cai, Z., Heydari, M., Lin, G.: Iterated local least squares imputation for microarray missing values. J. Bioinform. Comput. Biol. 4(5), 935–957 (2006)

    Article  Google Scholar 

Download references

Acknowledgement

This paper was supported by NGFR 973 grant 2012CB316200, NSFC grant U1509216, 61472099, 61133002 and National Sci-Tech Support Plan 2015BAH10F01.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chen Ye .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Ye, C., Wang, H., Li, J., Gao, H., Cheng, S. (2016). Crowdsourcing-Enhanced Missing Values Imputation Based on Bayesian Network. In: Navathe, S., Wu, W., Shekhar, S., Du, X., Wang, X., Xiong, H. (eds) Database Systems for Advanced Applications. DASFAA 2016. Lecture Notes in Computer Science(), vol 9642. Springer, Cham. https://doi.org/10.1007/978-3-319-32025-0_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-32025-0_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-32024-3

  • Online ISBN: 978-3-319-32025-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics