Abstract
Set of tuples expansion system (STEP) extracts information from the Web in the form of tuples. It builds a graph of entities consisting of Web pages, wrappers, seeds, domains, and candidates as its nodes while the relationships between them as edges. The final weight given for each node after running random walks on the graph is used to order the extracted candidates. Due to the nature of the regular expressions used as wrappers, some of the extracted candidates may contain “noise” and therefore can be considered as “false”. These false candidates may rank higher than the “true” ones on the list because they are extracted from many Web pages or produced by many different wrappers. Minimizing these false candidates is necessary to ensure the validity of the result presented.
In this research, we propose a method to tackle the aforementioned problem of STEP by reconstructing tuples. We begin with extracting binary tuples from the Web. These binary tuples consist of a key attribute and a property of the attribute. To validate the truthfulness of the binary tuples, we apply truth-finding algorithms. This helps us in building a credible list of binary tuples. We propose two methods to reconstruct tuples from binary ones. We use the reconstructed tuples to enrich the graph of entities of STEP such that the “true” candidates receive more confidence and rank higher in the graph. We show that our approach is efficient and significantly improve the confidence level of the tuples extracted by STEP. We also conduct an experiment on a real-world case of populating a database relation from the Web with our proposed approach.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abdessalem, T., Cautis, B., Derouiche, N.: Objectrunner: lightweight, targeted extraction and querying of structured web data. PVLDB 3(2), 1585–1588 (2010). http://www.comp.nus.edu.sg/~vldb2010/proceedings/files/papers/D18.pdf
Ba, M.L., Berti-Equille, L., Shah, K., Hammady, H.M.: VERA: a platform for veracity estimation over web data. In: WWW (2016)
Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009). http://dx.doi.org/10.1007/s00778-008-0098-x
Bing, L., Lam, W., Wong, T.L.: Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning. In: WSDM, New York, NY, USA (2013)
Bleiholder, J., Draba, K., Naumann, F.: FuSem: exploring different semantics of data fusion. In: VLDB, Vienna, Austria (2007)
Brin, S.: Extracting patterns and relations from the World Wide Web. In: Atzeni, P., Mendelzon, A., Mecca, G. (eds.) WebDB 1998. LNCS, vol. 1590, pp. 172–183. Springer, Heidelberg (1999). https://doi.org/10.1007/10704656_11
Chen, Z., Cafarella, M., Jagadish, H.V.: Long-tail vocabulary dictionary extraction from the web. In: WSDM, New York, NY, USA (2016)
Derouiche, N., Cautis, B., Abdessalem, T.: Automatic extraction of structured web data with domain knowledge. In: 2012 IEEE 28th International Conference on Data Engineering, pp. 726–737, April 2012
Dong, X.L., Berti-Equille, L., Srivastava, D.: Integrating conflicting data: the role of source dependence. PVLDB 2(1), 550–561 (2009)
Dong, X.L., Berti-Equille, L., Srivastava, D.: Truth discovery and copying detection in a dynamic world. PVLDB 2(1), 562–573 (2009)
Dong, X.L., Naumann, F.: Data fusion: resolving data conflicts for integration. PVLDB 2(1), 1654–1655 (2009)
Er, N.A.S., Abdessalem, T., Bressan, S.: Set of t-uples expansion by example. In: iiWAS, New York, NY, USA (2016)
Er, N.A.S., Ba, M.L., Abdessalem, T., Bressan, S.: Truthfulness of candidates in set of t-uples expansion. In: Benslimane, D., Damiani, E., Grosky, W.I., Hameurlain, A., Sheth, A., Wagner, R.R. (eds.) DEXA 2017, Part I. LNCS, vol. 10438, pp. 314–323. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64468-4_24
Fagin, R., Haas, L.M., Hernández, M., Miller, R.J., Popa, L., Velegrakis, Y.: Clio: schema mapping creation and data exchange. In: Borgida, A.T., Chaudhri, V.K., Giorgini, P., Yu, E.S. (eds.) Conceptual Modeling: Foundations and Applications. LNCS, vol. 5600, pp. 198–236. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02463-4_12
Faheem, M., Senellart, P.: Adaptive web crawling through structure-based link classification. In: Allen, R.B., Hunter, J., Zeng, M.L. (eds.) ICADL 2015. LNCS, vol. 9469, pp. 39–51. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-27974-9_5
Fang, X.S.: Truth discovery from conflicting multi-valued objects. In: WWW, pp. 711–715 (2017)
Fang, X.S., Sheng, Q.Z., Wang, X., Ngu, A.H.: Value veracity estimation for multi-truth objects via a graph-based approach. In: WWW, pp. 777–778 (2017)
Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., Schallhart, C., Wang, C.: Diadem: thousands of websites to a single database. Proc. VLDB Endow. (PVLDB) 7, 1845–1856 (2014)
Galland, A., Abiteboul, S., Marian, A., Senellart, P.: Corroborating information from disagreeing views. In: WSDM, New York, USA, February 2010
Getoor, L., Machanavajjhala, A.: Entity resolution: theory, practice & open challenges. Proc. VLDB Endow. 5(12), 2018–2019 (2012). https://doi.org/10.14778/2367502.2367564
He, Y., Xin, D.: Seisa: set expansion by iterative similarity aggregation. In: WWW, New York, NY, USA (2011)
Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3(1–2), 484–493 (2010). https://doi.org/10.14778/1920841.1920904
Li, Q., Li, Y., Gao, J., Zhao, B., Fan, W., Han, J.: Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation. In: SIGMOD, Snowbird, Utah, USA, May 2014
Liu, W., Liu, J., Duan, H., Zhang, J., Hu, W., Wei, B.: TruthDiscover: resolving object conflicts on massive linked data. In: WWW, pp. 243–246 (2017)
Moens, M., Li, J., Chua, T. (eds.): Mining User Generated Content. Chapman and Hall/CRC, Boca Raton (2014)
Ortona, S., Orsi, G., Buoncristiano, M., Furche, T.: WADaR: joint wrapper and data repair. Proc. VLDB Endow. 8(12), 1996–1999 (2015). https://doi.org/10.14778/2824032.2824120
Paşca, M.: Weakly-supervised discovery of named entities using web search queries. In: CIKM, New York, NY, USA (2007)
Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. Technical report (1999)
Pasternack, J., Roth, D.: Latent credibility analysis. In: WWW, Rio de Janeiro, Brazil, May 2013
Pochampally, R., Das Sarma, A., Dong, X.L., Meliou, A., Srivastava, D.: Fusing data with correlations. In: SIGMOD, Snowbird, Utah, USA, May 2014
Qiu, D., Barbosa, L., Dong, X.L., Shen, Y., Srivastava, D.: Dexter: large-scale discovery and extraction of product specifications on the web. Proc. VLDB Endow. 8(13), 2194–2205 (2015). https://doi.org/10.14778/2831360.2831372
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001). https://doi.org/10.1007/s007780100057
Sarker, A., Gonzalez, G.: Portable automatic text classification for adverse drug reaction detection via multi-corpus training. J. Biomed. Inform. 53, 196–207 (2015)
Wang, D., Kaplan, L., Le, H., Abdelzaher, T.: On truth discovery in social sensing: a maximum likelihood estimation approach. In: IPSN, Beijing, China, April 2012
Wang, R.C., Cohen, W.W.: Language-independent set expansion of named entities using the web. In: ICDM (2007)
Wang, R.C., Cohen, W.W.: Character-level analysis of semi-structured documents for set expansion. In: EMNPL, Stroudsburg, PA, USA (2009)
Wang, R.C., Schlaefer, N., Cohen, W.W., Nyberg, E.: Automatic set expansion for list question answering. In: EMNLP, Stroudsburg, PA, USA (2008)
Yin, X., Han, J., Yu, P.S.: Truth discovery with multiple conflicting information providers on the web. IEEE TKDE 20, 796–808 (2008)
Zhang, W., Ahmed, A., Yang, J., Josifovski, V., Smola, A.J.: Annotating needles in the haystack without looking: product information extraction from emails. In: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2015, pp. 2257–2266. ACM, New York (2015). https://doi.org/10.1145/2783258.2788580
Zhang, Z., Sun, L., Han, X.: A joint model for entity set expansion and attribute extraction from web search queries. In: AAAI (2016)
Zhao, B., Rubinstein, B.I.P., Gemmell, J., Han, J.: A Bayesian approach to discovering truth from conflicting sources for data integration. PVLDB 5(6), 550–561 (2012)
Zhao, Z., Cheng, J., Ng, W.: Truth discovery in data streams: A single-pass probabilistic approach. In: CIKM, Shangai, China, November 2014
Acknowledgment
This work has been partially funded by the Big Data and Market Insights Chair of Télécom ParisTech and supported by the National University of Singapore under a grant from Singapore Ministry of Education for research project number T1 251RES1607.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Er, N.A.S., Ba, M.L., Abdessalem, T., Bressan, S. (2018). Tuple Reconstruction. In: Liu, C., Zou, L., Li, J. (eds) Database Systems for Advanced Applications. DASFAA 2018. Lecture Notes in Computer Science(), vol 10829. Springer, Cham. https://doi.org/10.1007/978-3-319-91455-8_21
Download citation
DOI: https://doi.org/10.1007/978-3-319-91455-8_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-91454-1
Online ISBN: 978-3-319-91455-8
eBook Packages: Computer ScienceComputer Science (R0)