Tuple Reconstruction

Er, Ngurah Agus Sanjaya; Ba, Mouhamadou Lamine; Abdessalem, Talel; Bressan, Stéphane

doi:10.1007/978-3-319-91455-8_21

Ngurah Agus Sanjaya Er^16,19,
Mouhamadou Lamine Ba¹⁷,
Talel Abdessalem^16,18,19 &
…
Stéphane Bressan¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10829))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

916 Accesses

Abstract

Set of tuples expansion system (STEP) extracts information from the Web in the form of tuples. It builds a graph of entities consisting of Web pages, wrappers, seeds, domains, and candidates as its nodes while the relationships between them as edges. The final weight given for each node after running random walks on the graph is used to order the extracted candidates. Due to the nature of the regular expressions used as wrappers, some of the extracted candidates may contain “noise” and therefore can be considered as “false”. These false candidates may rank higher than the “true” ones on the list because they are extracted from many Web pages or produced by many different wrappers. Minimizing these false candidates is necessary to ensure the validity of the result presented.

In this research, we propose a method to tackle the aforementioned problem of STEP by reconstructing tuples. We begin with extracting binary tuples from the Web. These binary tuples consist of a key attribute and a property of the attribute. To validate the truthfulness of the binary tuples, we apply truth-finding algorithms. This helps us in building a credible list of binary tuples. We propose two methods to reconstruct tuples from binary ones. We use the reconstructed tuples to enrich the graph of entities of STEP such that the “true” candidates receive more confidence and rank higher in the graph. We show that our approach is efficient and significantly improve the confidence level of the tuples extracted by STEP. We also conduct an experiment on a real-world case of populating a database relation from the Web with our proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.tripadvisor.com/Restaurants-g297697-Kuta_Kuta_District_Bali.html.

References

Abdessalem, T., Cautis, B., Derouiche, N.: Objectrunner: lightweight, targeted extraction and querying of structured web data. PVLDB 3(2), 1585–1588 (2010). http://www.comp.nus.edu.sg/~vldb2010/proceedings/files/papers/D18.pdf
Google Scholar
Ba, M.L., Berti-Equille, L., Shah, K., Hammady, H.M.: VERA: a platform for veracity estimation over web data. In: WWW (2016)
Google Scholar
Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009). http://dx.doi.org/10.1007/s00778-008-0098-x
Article Google Scholar
Bing, L., Lam, W., Wong, T.L.: Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning. In: WSDM, New York, NY, USA (2013)
Google Scholar
Bleiholder, J., Draba, K., Naumann, F.: FuSem: exploring different semantics of data fusion. In: VLDB, Vienna, Austria (2007)
Google Scholar
Brin, S.: Extracting patterns and relations from the World Wide Web. In: Atzeni, P., Mendelzon, A., Mecca, G. (eds.) WebDB 1998. LNCS, vol. 1590, pp. 172–183. Springer, Heidelberg (1999). https://doi.org/10.1007/10704656_11
Chapter Google Scholar
Chen, Z., Cafarella, M., Jagadish, H.V.: Long-tail vocabulary dictionary extraction from the web. In: WSDM, New York, NY, USA (2016)
Google Scholar
Derouiche, N., Cautis, B., Abdessalem, T.: Automatic extraction of structured web data with domain knowledge. In: 2012 IEEE 28th International Conference on Data Engineering, pp. 726–737, April 2012
Google Scholar
Dong, X.L., Berti-Equille, L., Srivastava, D.: Integrating conflicting data: the role of source dependence. PVLDB 2(1), 550–561 (2009)
Google Scholar
Dong, X.L., Berti-Equille, L., Srivastava, D.: Truth discovery and copying detection in a dynamic world. PVLDB 2(1), 562–573 (2009)
Google Scholar
Dong, X.L., Naumann, F.: Data fusion: resolving data conflicts for integration. PVLDB 2(1), 1654–1655 (2009)
Google Scholar
Er, N.A.S., Abdessalem, T., Bressan, S.: Set of t-uples expansion by example. In: iiWAS, New York, NY, USA (2016)
Google Scholar
Er, N.A.S., Ba, M.L., Abdessalem, T., Bressan, S.: Truthfulness of candidates in set of t-uples expansion. In: Benslimane, D., Damiani, E., Grosky, W.I., Hameurlain, A., Sheth, A., Wagner, R.R. (eds.) DEXA 2017, Part I. LNCS, vol. 10438, pp. 314–323. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64468-4_24
Chapter Google Scholar
Fagin, R., Haas, L.M., Hernández, M., Miller, R.J., Popa, L., Velegrakis, Y.: Clio: schema mapping creation and data exchange. In: Borgida, A.T., Chaudhri, V.K., Giorgini, P., Yu, E.S. (eds.) Conceptual Modeling: Foundations and Applications. LNCS, vol. 5600, pp. 198–236. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02463-4_12
Chapter Google Scholar
Faheem, M., Senellart, P.: Adaptive web crawling through structure-based link classification. In: Allen, R.B., Hunter, J., Zeng, M.L. (eds.) ICADL 2015. LNCS, vol. 9469, pp. 39–51. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-27974-9_5
Chapter Google Scholar
Fang, X.S.: Truth discovery from conflicting multi-valued objects. In: WWW, pp. 711–715 (2017)
Google Scholar
Fang, X.S., Sheng, Q.Z., Wang, X., Ngu, A.H.: Value veracity estimation for multi-truth objects via a graph-based approach. In: WWW, pp. 777–778 (2017)
Google Scholar
Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., Schallhart, C., Wang, C.: Diadem: thousands of websites to a single database. Proc. VLDB Endow. (PVLDB) 7, 1845–1856 (2014)
Article Google Scholar
Galland, A., Abiteboul, S., Marian, A., Senellart, P.: Corroborating information from disagreeing views. In: WSDM, New York, USA, February 2010
Google Scholar
Getoor, L., Machanavajjhala, A.: Entity resolution: theory, practice & open challenges. Proc. VLDB Endow. 5(12), 2018–2019 (2012). https://doi.org/10.14778/2367502.2367564
Article Google Scholar
He, Y., Xin, D.: Seisa: set expansion by iterative similarity aggregation. In: WWW, New York, NY, USA (2011)
Google Scholar
Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3(1–2), 484–493 (2010). https://doi.org/10.14778/1920841.1920904
Article Google Scholar
Li, Q., Li, Y., Gao, J., Zhao, B., Fan, W., Han, J.: Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation. In: SIGMOD, Snowbird, Utah, USA, May 2014
Google Scholar
Liu, W., Liu, J., Duan, H., Zhang, J., Hu, W., Wei, B.: TruthDiscover: resolving object conflicts on massive linked data. In: WWW, pp. 243–246 (2017)
Google Scholar
Moens, M., Li, J., Chua, T. (eds.): Mining User Generated Content. Chapman and Hall/CRC, Boca Raton (2014)
Google Scholar
Ortona, S., Orsi, G., Buoncristiano, M., Furche, T.: WADaR: joint wrapper and data repair. Proc. VLDB Endow. 8(12), 1996–1999 (2015). https://doi.org/10.14778/2824032.2824120
Article Google Scholar
Paşca, M.: Weakly-supervised discovery of named entities using web search queries. In: CIKM, New York, NY, USA (2007)
Google Scholar
Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. Technical report (1999)
Google Scholar
Pasternack, J., Roth, D.: Latent credibility analysis. In: WWW, Rio de Janeiro, Brazil, May 2013
Google Scholar
Pochampally, R., Das Sarma, A., Dong, X.L., Meliou, A., Srivastava, D.: Fusing data with correlations. In: SIGMOD, Snowbird, Utah, USA, May 2014
Google Scholar
Qiu, D., Barbosa, L., Dong, X.L., Shen, Y., Srivastava, D.: Dexter: large-scale discovery and extraction of product specifications on the web. Proc. VLDB Endow. 8(13), 2194–2205 (2015). https://doi.org/10.14778/2831360.2831372
Article Google Scholar
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001). https://doi.org/10.1007/s007780100057
Article MATH Google Scholar
Sarker, A., Gonzalez, G.: Portable automatic text classification for adverse drug reaction detection via multi-corpus training. J. Biomed. Inform. 53, 196–207 (2015)
Article Google Scholar
Wang, D., Kaplan, L., Le, H., Abdelzaher, T.: On truth discovery in social sensing: a maximum likelihood estimation approach. In: IPSN, Beijing, China, April 2012
Google Scholar
Wang, R.C., Cohen, W.W.: Language-independent set expansion of named entities using the web. In: ICDM (2007)
Google Scholar
Wang, R.C., Cohen, W.W.: Character-level analysis of semi-structured documents for set expansion. In: EMNPL, Stroudsburg, PA, USA (2009)
Google Scholar
Wang, R.C., Schlaefer, N., Cohen, W.W., Nyberg, E.: Automatic set expansion for list question answering. In: EMNLP, Stroudsburg, PA, USA (2008)
Google Scholar
Yin, X., Han, J., Yu, P.S.: Truth discovery with multiple conflicting information providers on the web. IEEE TKDE 20, 796–808 (2008)
Google Scholar
Zhang, W., Ahmed, A., Yang, J., Josifovski, V., Smola, A.J.: Annotating needles in the haystack without looking: product information extraction from emails. In: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2015, pp. 2257–2266. ACM, New York (2015). https://doi.org/10.1145/2783258.2788580
Zhang, Z., Sun, L., Han, X.: A joint model for entity set expansion and attribute extraction from web search queries. In: AAAI (2016)
Google Scholar
Zhao, B., Rubinstein, B.I.P., Gemmell, J., Han, J.: A Bayesian approach to discovering truth from conflicting sources for data integration. PVLDB 5(6), 550–561 (2012)
Google Scholar
Zhao, Z., Cheng, J., Ng, W.: Truth discovery in data streams: A single-pass probabilistic approach. In: CIKM, Shangai, China, November 2014
Google Scholar

Download references

Acknowledgment

This work has been partially funded by the Big Data and Market Insights Chair of Télécom ParisTech and supported by the National University of Singapore under a grant from Singapore Ministry of Education for research project number T1 251RES1607.

Author information

Authors and Affiliations

Télécom Paristech, Paris, France
Ngurah Agus Sanjaya Er & Talel Abdessalem
Université Alioune Diop de Bambey, Bambey, Senegal
Mouhamadou Lamine Ba
National University of Singapore, Singapore, Singapore
Talel Abdessalem & Stéphane Bressan
UMI IPAL, CNRS, Paris, France
Ngurah Agus Sanjaya Er & Talel Abdessalem

Authors

Ngurah Agus Sanjaya Er
View author publications
You can also search for this author in PubMed Google Scholar
Mouhamadou Lamine Ba
View author publications
You can also search for this author in PubMed Google Scholar
Talel Abdessalem
View author publications
You can also search for this author in PubMed Google Scholar
Stéphane Bressan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ngurah Agus Sanjaya Er .

Editor information

Editors and Affiliations

Swinburne University of Technology, Hawthorn, VIC, Australia
Chengfei Liu
Peking University, Beijing, China
Lei Zou
University of Western Australia, Crawley, WA, Australia
Jianxin Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Er, N.A.S., Ba, M.L., Abdessalem, T., Bressan, S. (2018). Tuple Reconstruction. In: Liu, C., Zou, L., Li, J. (eds) Database Systems for Advanced Applications. DASFAA 2018. Lecture Notes in Computer Science(), vol 10829. Springer, Cham. https://doi.org/10.1007/978-3-319-91455-8_21

Download citation

DOI: https://doi.org/10.1007/978-3-319-91455-8_21
Published: 12 May 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-91454-1
Online ISBN: 978-3-319-91455-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics