Skip to main content
Log in

SmartVote: a full-fledged graph-based model for multi-valued truth discovery

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

In the era of Big Data, truth discovery has emerged as a fundamental research topic, which estimates data veracity by determining the reliability of multiple, often conflicting data sources. Although considerable research efforts have been conducted on this topic, most current approaches assume only one true value for each object. In reality, objects with multiple true values widely exist and the existing approaches that cope with multi-valued objects still lack accuracy. In this paper, we propose a full-fledged graph-based model, SmartVote, which models two types of source relations with additional quantification to precisely estimate source reliability for effective multi-valued truth discovery. Two graphs are constructed and further used to derive different aspects of source reliability (i.e., positive precision and negative precision) via random walk computations. Our model incorporates four important implications, including two types of source relations, object popularity, loose mutual exclusion, and long-tail phenomenon on source coverage, to pursue better accuracy in truth discovery. Empirical studies on two large real-world datasets demonstrate the effectiveness of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6

Similar content being viewed by others

Notes

  1. In this paper we focus on the parent-children relation in the dataset because it corresponds to multi-valued objects.

  2. Note that this probability is based on a prior knowledge that s1 and s2 each provides a true value, which is different from the probability of two sources s1 and s2 independently provide the same true value.

  3. Here we neglect the smoothing links, if there is no common value between two sources, there is no link between them in the graphs.

  4. We neglect the confidence scores of each source and omit the dependence score normalization step in this example.

  5. https://hama.apache.org/

  6. Such values are then normalized to represent probabilities.

  7. For Voting, we predict the number of true values as the number with the highest vote counts.

  8. Note that there are overlaps among those categories. For example, Investment belongs to both Web-link based methods and iterative methods.

References

  1. Benslimane, D., et al.: The uncertain Web: concepts, challenges, and current solutions. ACM Transactions on Internet Technology (TOIT) 16(1), 1 (2015)

    Article  Google Scholar 

  2. Bleiholder, J., Naumann, F.: Conflict handling strategies in an integrated information system. In: Proceedings of the Intelligence Workshop on Information Integration on the Web (IIWeb) (2006)

  3. Bleiholder, J., Naumann, F.: Data fusion. ACM Computing Surveys (CSUR) 41(1), 1–41 (2009)

    Article  Google Scholar 

  4. Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst. 30(1–7), 107–117 (1998)

    Article  Google Scholar 

  5. Dong, X.L., Berti-Equille, L., Hu, Y., Srivastava, D.: Global detection of complex copying relationships between sources. Proc. VLDB Endowment 3(1-2), 1358–1369 (2010)

    Article  Google Scholar 

  6. Dong, X.L., Berti-Equille, L., Srivastava, D.: Integrating conflicting data: the role of source dependence. Proc. VLDB Endowment 2(1), 550–561 (2009)

    Article  Google Scholar 

  7. Dong, X.L., Berti-Equille, L., Srivastava, D.: Truth discovery and copying detection in a dynamic world. Proc. VLDB Endowment 2(1), 562–573 (2009)

    Article  Google Scholar 

  8. Dong, X.L., Naumann, F.: Data fusion: resolving data conflicts for integration. Proc. VLDB Endowment 2(2), 1654–1655 (2009)

    Article  Google Scholar 

  9. Dong, X.L., Saha, B., Srivastava, D.: Less is more: selecting sources wisely for integration. Proc. VLDB Endowment 6(2), 37–48 (2012)

    Article  Google Scholar 

  10. Dong, X.L., et al.: From data fusion to knowledge fusion. Proc. VLDB Endowment 7(10), 881–892 (2014)

    Article  Google Scholar 

  11. Dong, X.L., et al.: Knowledge vault: a Web-scale approach to probabilistic knowledge fusion. In: Proceedings of the ACM SIGKDD Intelligence Conference on Knowledge Discovery and Data Mining, pp 601–610 (2014)

  12. Fan, W.: Data quality: theory and practice. In: Web-Age Information Management, pp 1–16 (2012)

  13. Fan, W., et al.: Data quality problems beyond consistency and duduplication. In: Search of Elegance in the Theory and Practice of Computation, pp 237–249 (2013)

  14. Fang, X.S.: Generating actionable knowledge from big data. In: Proceedings of the 2015 SIGMOD Phd Symposium (SIGMOD), pp 3–8 (2015)

  15. Fang, X.S., Sheng, Q.Z., Wang, X., Ngu, A.H.: Value veracity estimation for multi-truth ojbects via a graph-based approach. In: Proceedings of the Intelligence World Wide Web Conference (WWW), pp 217–226 (2017)

  16. Fang, X.S., Wang, X., Sheng, Q.Z.: Ontology augmentation via attribute extraction from multiple types of sources. In: Proceedings of the 26Th Australasian Database Conference (ADC), pp 16–27 (2015)

  17. Galland, A., et al.: Corroborating information from disagreeing views. In: Proceedings of the ACM Intelligence Conference on Web Search and Data Mining (WSDM), pp 131–140 (2010)

  18. Gao, J., Li, Q., Zhao, B., Fan, W., Han, J.: Truth discovery and crowdsourcing aggregation: a unified perspective. Proc. VLDB Endowment 8(12), 2048–2049 (2015)

    Article  Google Scholar 

  19. Gleich, D.F., et al.: Tracking the random surfer: empirically measured teleportation parameters in pagerank. In: Proceedings of the Intelligence World Wide Web Conference (WWW), pp 381–390 (2010)

  20. Gwet, K.L.: Handbook of inter-rater reliability: the definitive guide to measuring the extent of agreement among raters. Adv. Anal. LLC 4, 57–64 (2014)

  21. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  22. Li, Q., et al.: A confidence-aware approach for truth discovery on long-tail data. Proc. VLDB Endowment 8(4), 425–436 (2014)

    Article  Google Scholar 

  23. Li, Q., et al.: Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation. In: Proceedings ACM SIGMOD Intelligence Conference on Management of Data, pp 1187–1198 (2014)

  24. Li, X., Dong, X.L., Lyons, K., Meng, W., Srivastava, D.: Truth finding on the deep web: is the problem solved?. Proc. VLDB Endowment 6(2), 97–108 (2012)

    Article  Google Scholar 

  25. Li, X., et al.: Scaling up copy detection. In: IEEE Intelligence Conference on Data Engineering (ICDE), pp 89–100 (2015)

  26. Li, Y., et al.: A survey on truth discovery. ACM SIGKDD Explor. Newslett. 17(2), 1–16 (2016)

    Article  Google Scholar 

  27. Liu, X., et al.: Online data fusion. Proc. VLDB Endowment 4(11), 932–943 (2011)

    Google Scholar 

  28. Mukherjee, S., et al.: People on drugs: credibility of user statements in health communities. In: ACM SIGKDD Intelligence Conference on Knowledge Discovery and Data Mining, pp 65–74 (2014)

  29. Naumann, F., et al.: Data fusion in three steps: resolving schema, tuple, and value inconsistencies. IEEE Data Eng. Bull. 29(2), 21–31 (2006)

    Google Scholar 

  30. Pasternack, J., Roth, D.: Knowing what to believe (when you already know something). In: Proceedings of Intelligent Conference on Computational Linguistics (COLING), pp 877–885 (2010)

  31. Pochampally, R., et al.: Fusing data with correlations. In: Proceedings of the ACM SIGMOD Intelligent Conference on Management of Data, pp 433–444 (2014)

  32. Popat, K., Mukherjee, S., Strötgen, J., Weikum, G.: Where the truth lies: explaining the credibility of emerging claims on the Web and social media. In: Proceedings Intelligent World Wide Web Conference (WWW), pp 1003–1012 (2017)

  33. Rozenshtein, P., Anagnostopoulos, A., Gionis, A., Tatti, N.: Event detection in activity networks. In: Proceedings of the ACM SIGKDD Intelligent Conference on Knowledge Discovery and Data Mining, pp 1176–1185 (2014)

  34. Waguih, D.A., Berti-Equille, L.: Truth discovery algorithms: an experimental evaluation. arXiv:1409.6428 (2014)

  35. Wan, M., et al.: From truth discovery to trustworthy opinion discovery: an uncertainty-aware quantitative modeling approach. In: Proceedings of the ACM SIGKDD Intelligent Conference on Knowledge Discovery and Data Mining, pp 1885–1894 (2016)

  36. Wang, X., et al.: An integrated Bayesian approach for effective multi-truth discovery. In: Proceedings the 24th ACM Intelligent Conference on Information and Knowledge Management (CIKM), pp 493–502 (2015)

  37. Wang, X., et al: Empowering truth discovery with multi-truth prediction. In: Proceedings the 25th ACM Intelligent Conference on Information and Knowledge Management (CIKM), pp 881–890 (2016)

  38. Wang, X., et al.: Truth discovery via exploiting implications from multi-source data. In: Proceedings the 25th ACM Intelligent Conference on Information and Knowledge Management (CIKM), pp 861–870 (2016)

  39. Xiao, H., Gao, J., Li, Q., Ma, F., Su, L., Feng, Y., Zhang, A.: Towards confidence in the truth: a bootstrapping based truth discovery approach. In: Proceedings ACM SIGKDD Intelligent Conference on Knowledge Discovery and Data Mining, pp 1935–1944 (2016)

  40. Xiao, H., Gao, J., Wang, Z., Wang, S., Su, L., Liu, H.: A truth discovery approach with theoretical guarantee. In: Proceedings of the 22th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2016, pp 1925–1934 (2016)

  41. Yin, X., Han, J., Yu, P.S.: Truth discovery with multiple conflicting information providers on the Web. IEEE Transactions on Knowledge and Data Engineering (TKDE) 20(6), 796–808 (2008)

    Article  Google Scholar 

  42. Yin, X., et al.: Semi-supervised truth discovery. In: Proceedings Intelligent World Wide Web Conference (WWW), pp 217–226 (2011)

  43. Yu, D., et al.: The wisdom of minority: unsupervised slot filling validation based on multi-dimensional truth-finding. In: Proceedings Intelligent Conference on Computational Linguistics (COLING), pp 1567–1578 (2014)

  44. Zhang, H., Li, Q., Ma, F., Xiao, H., Li, Y., Gao, J., Su, L.: Influence-aware truth discovery. In: Proceedings the 25th ACM Intelligent Conference on Information and Knowledge Management (CIKM), pp 851–860 (2016)

  45. Zhao, B., Han, J.: A probabilistic model for estimating real-valued truth from conflicting sources. In: Proceedings of the Intelligent Workshop on Quality in Databases (QDB), Coheld with VLDB (2012)

  46. Zhao, B., Rubinstein, B.I., Gemmell, J., Han, J.: A bayesian approach to discovering truth from conflicting sources for data integration. Proc. VLDB Endowment 5(6), 550–561 (2012)

    Article  Google Scholar 

  47. Zhi, S., Zhao, B., Tong, W., Gao, J., Yu, D., Ji, H., Han, J.: Modeling truth existence in truth discovery. In: Proceedings ACM SIGKDD Intelligent Conference on Knowledge Discovery and Data Mining, pp 1543–1552 (2015)

Download references

Acknowledgements

Quan Z. Sheng’s research has been partially supported by Australian Research Council (ARC) Future Fellowship FT140101247 and Discovery Project Grant DP180102378. Dianhui Chu’s research has been partially supported by National Science Foundation of China (NSFC, No 61772159). The authors would like to thank the anonymous reviewers for their valuable feedback on this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiu Susie Fang.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fang, X.S., Sheng, Q.Z., Wang, X. et al. SmartVote: a full-fledged graph-based model for multi-valued truth discovery. World Wide Web 22, 1855–1885 (2019). https://doi.org/10.1007/s11280-018-0629-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-018-0629-3

Keywords

Navigation