Probabilistic Models to Reconcile Complex Data from Inaccurate Data Sources

  • Lorenzo Blanco
  • Valter Crescenzi
  • Paolo Merialdo
  • Paolo Papotti
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6051)


Several techniques have been developed to extract and integrate data from web sources. However, web data are inherently imprecise and uncertain. This paper addresses the issue of characterizing the uncertainty of data extracted from a number of inaccurate sources. We develop a probabilistic model to compute a probability distribution for the extracted values, and the accuracy of the sources. Our model considers the presence of sources that copy their contents from other sources, and manages the misleading consensus produced by copiers. We extend the models previously proposed in the literature by working on several attributes at a time to better leverage all the available evidence. We also report the results of several experiments on both synthetic and real-life data to show the effectiveness of the proposed approach.


Probabilistic Model Average Precision Prior Probability Distribution Source Accuracy Good Leverage 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Berti-Equille, L., Sarma, A.D., Dong, X., Marian, A., Srivastava, D.: Sailing the information ocean with awareness of currents: Discovery and application of source dependence. In: CIDR (2009)Google Scholar
  2. 2.
    Blanco, L., Crescenzi, V., Merialdo, P., Papotti, P.: Flint: Google-basing the web. In: EDBT (2008)Google Scholar
  3. 3.
    Blanco, L., Crescenzi, V., Merialdo, P., Papotti, P.: A probabilistic model to characterize the uncertainty of web data integration: What sources have the good data? Technical report, DIA - Roma Tre - TR146 (June 2009)Google Scholar
  4. 4.
    Cafarella, M.J., Etzioni, O., Suciu, D.: Structured queries over web text. IEEE Data Eng. Bull. 29(4), 45–51 (2006)Google Scholar
  5. 5.
    Dalvi, N.N., Suciu, D.: Management of probabilistic data: foundations and challenges. In: PODS, pp. 1–12 (2007)Google Scholar
  6. 6.
    Dong, X.L., Berti-Equille, L., Srivastava, D.: Integrating conflicting data: The role of source dependence. PVLDB 2(1), 550–561 (2009)Google Scholar
  7. 7.
    Dong, X.L., Berti-Equille, L., Srivastava, D.: Truth discovery and copying detection in a dynamic world. PVLDB 2(1), 562–573 (2009)Google Scholar
  8. 8.
    Downey, D., Etzioni, O., Soderland, S.: A probabilistic model of redundancy in information extraction. In: IJCAI, pp. 1034–1041 (2005)Google Scholar
  9. 9.
    Florescu, D., Koller, D., Levy, A.Y.: Using probabilistic information in data integration. In: VLDB, pp. 216–225 (1997)Google Scholar
  10. 10.
    Galland, A., Abiteboul, S., Marian, A., Senellart, P.: Corroborating information from disagreeing views. In: Proc. WSDM, New York, USA (2010)Google Scholar
  11. 11.
    Wu, M., Marian, A.: Corroborating answers from multiple web sources. In: WebDB (2007)Google Scholar
  12. 12.
    Yin, X., Han, J., Yu, P.S.: Truth discovery with multiple conflicting information providers on the web. IEEE Trans. Knowl. Data Eng. 20(6), 796–808 (2008)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Lorenzo Blanco
    • 1
  • Valter Crescenzi
    • 1
  • Paolo Merialdo
    • 1
  • Paolo Papotti
    • 1
  1. 1.Università degli Studi Roma TreRomeItaly

Personalised recommendations