Probabilistic Models to Reconcile Complex Data from Inaccurate Data Sources

Blanco, Lorenzo; Crescenzi, Valter; Merialdo, Paolo; Papotti, Paolo

doi:10.1007/978-3-642-13094-6_8

Lorenzo Blanco¹⁷,
Valter Crescenzi¹⁷,
Paolo Merialdo¹⁷ &
…
Paolo Papotti¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6051))

Included in the following conference series:

International Conference on Advanced Information Systems Engineering

1959 Accesses
11 Citations

Abstract

Several techniques have been developed to extract and integrate data from web sources. However, web data are inherently imprecise and uncertain. This paper addresses the issue of characterizing the uncertainty of data extracted from a number of inaccurate sources. We develop a probabilistic model to compute a probability distribution for the extracted values, and the accuracy of the sources. Our model considers the presence of sources that copy their contents from other sources, and manages the misleading consensus produced by copiers. We extend the models previously proposed in the literature by working on several attributes at a time to better leverage all the available evidence. We also report the results of several experiments on both synthetic and real-life data to show the effectiveness of the proposed approach.

Download to read the full chapter text

Chapter PDF

Quantifying and Propagating Uncertainty in Automated Linked Data Integration

Data Fusion: Resolving Conflicts from Multiple Sources

Belief Revision in Uncertain Data Integration

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Berti-Equille, L., Sarma, A.D., Dong, X., Marian, A., Srivastava, D.: Sailing the information ocean with awareness of currents: Discovery and application of source dependence. In: CIDR (2009)
Google Scholar
Blanco, L., Crescenzi, V., Merialdo, P., Papotti, P.: Flint: Google-basing the web. In: EDBT (2008)
Google Scholar
Blanco, L., Crescenzi, V., Merialdo, P., Papotti, P.: A probabilistic model to characterize the uncertainty of web data integration: What sources have the good data? Technical report, DIA - Roma Tre - TR146 (June 2009)
Google Scholar
Cafarella, M.J., Etzioni, O., Suciu, D.: Structured queries over web text. IEEE Data Eng. Bull. 29(4), 45–51 (2006)
Google Scholar
Dalvi, N.N., Suciu, D.: Management of probabilistic data: foundations and challenges. In: PODS, pp. 1–12 (2007)
Google Scholar
Dong, X.L., Berti-Equille, L., Srivastava, D.: Integrating conflicting data: The role of source dependence. PVLDB 2(1), 550–561 (2009)
Google Scholar
Dong, X.L., Berti-Equille, L., Srivastava, D.: Truth discovery and copying detection in a dynamic world. PVLDB 2(1), 562–573 (2009)
Google Scholar
Downey, D., Etzioni, O., Soderland, S.: A probabilistic model of redundancy in information extraction. In: IJCAI, pp. 1034–1041 (2005)
Google Scholar
Florescu, D., Koller, D., Levy, A.Y.: Using probabilistic information in data integration. In: VLDB, pp. 216–225 (1997)
Google Scholar
Galland, A., Abiteboul, S., Marian, A., Senellart, P.: Corroborating information from disagreeing views. In: Proc. WSDM, New York, USA (2010)
Google Scholar
Wu, M., Marian, A.: Corroborating answers from multiple web sources. In: WebDB (2007)
Google Scholar
Yin, X., Han, J., Yu, P.S.: Truth discovery with multiple conflicting information providers on the web. IEEE Trans. Knowl. Data Eng. 20(6), 796–808 (2008)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Università degli Studi Roma Tre, Via della Vasca Navale, 79, Rome, Italy
Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo & Paolo Papotti

Authors

Lorenzo Blanco
View author publications
You can also search for this author in PubMed Google Scholar
Valter Crescenzi
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Merialdo
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Papotti
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dipartimento di Elettronica e Informazione, Politecnico di Milano,, Piazza Leonardo da Vinci 32, 20133, Milano, Italy
Barbara Pernici

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Blanco, L., Crescenzi, V., Merialdo, P., Papotti, P. (2010). Probabilistic Models to Reconcile Complex Data from Inaccurate Data Sources. In: Pernici, B. (eds) Advanced Information Systems Engineering. CAiSE 2010. Lecture Notes in Computer Science, vol 6051. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13094-6_8

Download citation

DOI: https://doi.org/10.1007/978-3-642-13094-6_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13093-9
Online ISBN: 978-3-642-13094-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Probabilistic Models to Reconcile Complex Data from Inaccurate Data Sources

Abstract

Chapter PDF

Similar content being viewed by others

Quantifying and Propagating Uncertainty in Automated Linked Data Integration

Data Fusion: Resolving Conflicts from Multiple Sources

Belief Revision in Uncertain Data Integration

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Probabilistic Models to Reconcile Complex Data from Inaccurate Data Sources

Abstract

Chapter PDF

Similar content being viewed by others

Quantifying and Propagating Uncertainty in Automated Linked Data Integration

Data Fusion: Resolving Conflicts from Multiple Sources

Belief Revision in Uncertain Data Integration

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation